© This material is licensed under a Creative Commons CC BY-NC-ND attribution which allows works to be downloaded and shared with others, as long as they are referenced, but may not be modified in any way or used commercially.

1 Introduction

Price research within the housing market stands as a pivotal compass guiding numerous stake-holders – buyers, sellers, investors and policymakers. The dynamics of property prices encapsulate a multitude of intricate factors, dictating not just market trends but also reflecting the societal, economic and geographical characteristics of a region. In this vein, the Ames Housing Dataset assumes significance as a comprehensive repository, offering a nuanced lens into the complex tapestry of housing attributes and sale prices in Ames, Iowa.

The study analysis a subset of the original dataset. It is important to underscore that while the chosen variables offer a glimpse into critical housing attributes, their selection was driven primarily by practical constraints of time and feasibility. The chosen variables predominantly orbit around the intrinsic aspects of properties – encompassing dimensions such as property quality, condition, size measurements and temporal data. This focus within the dataset allows to unravel the essence of housing valuations through an inspection of individual property features, steering the analytical spotlight away from broader neighborhood influences.

In pursuit of this understanding, this study harnesses a blend of univariate and multivariate analytical techniques within the R programming environment. The numerical analyses is accompanied by graphical visualization.

The initial phase of the analysis involves a univariate examination of the chosen variables. This step encompasses the scrutiny of missing data, classical numeric descriptive analysis, outlier detection and the assessment of normality assumptions. This step-by-step dissection offers a foundational understanding of the individual behaviors and distributions of these critical housing attributes.

Subsequently, the study transitions into a multivariate analysis, delving deeper into the complex interplay of these chosen variables. Employing correlation studies, the aim is to unveil intricate relationships and potential interdependencies among these property-specific attributes and the ultimate sale prices. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), will be employed to distill the essence of these variables, potentially uncovering latent factors that collectively contribute to housing valuations. Additionally, there is a contemplation of employing clustering algorithms to identify possible groupings or patterns within the dataset.

The core objective of this research is deeply understand the nuanced relationships between property attributes and sale prices themselves. By prioritizing this understanding, the study aims to reveal the most significant variables, those attributes that exhibit the strongest association or influence on housing sale prices within the local Ames real estate market. It is also aimed to construct classification methods in order to predict housing sale prices.

Loading/installation of R packages necessary for this practice.

The following source code module is responsible for loading, if they are already installed, all the packages that will be used in this R session.

#########################################
# Loading necessary packages and reason #
#########################################

# Package required to call 'ggplot' function
library(ggplot2)

# Package required to call 'ggarrange' function
library(ggpubr)

# Package required to call 'scatterplot3d' function
library(scatterplot3d)

# Package required to call 'melt' function
library(reshape2)

# Package required to call 'mvn' function
#library(MVN)

# Package required to call 'boxM' function
library(biotools)

# Package required to call 'partimat' function
library(klaR)

# Package required to call 'summarise' function
library(dplyr)

# Package required to call 'createDataPartition' function
library(caret)

# Package required to call 'freq' and 'descr' functions (descriptive statistics)
library(summarytools)

# Package required to call 'ggplot' function (graphical tools)
library(ggplot2)

# Package required to call 'ggarrange' function (graphical tools)
library(ggpubr)

# Package required to call 'read.spss' function (loading '.spss' data format)
library(foreign)

# Package required to call 'read_xlsx' function (loading '.xlsx' data format)
library(readxl)

# Package required to load the data set 'RBGlass1'
library(archdata)

# Package required to call 'cortest.bartlett' function
library(psych)

# Package required to call 'fviz_pca_var, fviz_pca_ind and fviz_pca' functions
library(factoextra)

# Package required to call 'scatterplot3d' function
library(scatterplot3d)

# Package required to call 'hetcor' function
library(polycor)

# Package required to call 'ggcorrplot' function
library(ggcorrplot)

# Package required to call 'corrplot' function
library(corrplot)

# Package required to call 'rplot' function
library(corrr)

# Package required to call 'mvn' function
library(MVN)

# Package required to plot ROC curve
library(pROC)
library(caret)

# Package required to call 'mutate' function
library(tidyverse)

# Package required to call 'clusGap' function
library(cluster)

# Package required to call 'get_dist', 'fviz_cluster' and 'fviz_dist' functions
library(factoextra)

# Package required to call 'ggdendrogram' function
library(ggdendro)

# Package required to call 'grid.arrange' function
library(gridExtra)

library(mclust)

library(moments)

In this chunk we will load the dataset.

# Loading the .csv file
# The output of this function is already a data.frame object
data<-read.csv("train.csv", header = TRUE,sep =",", fileEncoding = "UTF-8")

2 Univariate exploratory analysis

In this phase we carry out a preliminary exploratory analysis of the data. To do this, we apply different numerical and graphical techniques used in class. At first, we will focus on the analysis of each variable independently without yet looking for possible interactions between them, in order to detect: data groupings, missing values, classical numeric descriptive analysis, extreme values, assumption of normality.

# This line loads the variable names from this data.frame
# So that we can access by their name with no refer to the data.frame identifier
attach(data)

# We will only select 17 variables for our study
data <- data[, c(
  "MSSubClass",
  "MSZoning",
  "LotFrontage",
  "LotArea",
  "OverallQual",
  "OverallCond",
  "YearBuilt",
  "ExterQual",
  "ExterCond",
  "GarageQual",
  "GrLivArea",
  "X1stFlrSF",
  "X2ndFlrSF",
  "YrSold",
  "SaleType",
  "SaleCondition",
  "SalePrice"
)]


# Retrieving the name of the selected variables
colnames(data)
##  [1] "MSSubClass"    "MSZoning"      "LotFrontage"   "LotArea"      
##  [5] "OverallQual"   "OverallCond"   "YearBuilt"     "ExterQual"    
##  [9] "ExterCond"     "GarageQual"    "GrLivArea"     "X1stFlrSF"    
## [13] "X2ndFlrSF"     "YrSold"        "SaleType"      "SaleCondition"
## [17] "SalePrice"
# Displaying a few records
head(data)

Now we show the number of missing values per column

# Number of NA's values per column
na_counts <- colMeans(is.na(data))

print(na_counts)
##    MSSubClass      MSZoning   LotFrontage       LotArea   OverallQual 
##    0.00000000    0.00000000    0.17739726    0.00000000    0.00000000 
##   OverallCond     YearBuilt     ExterQual     ExterCond    GarageQual 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.05547945 
##     GrLivArea     X1stFlrSF     X2ndFlrSF        YrSold      SaleType 
##    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
## SaleCondition     SalePrice 
##    0.00000000    0.00000000

2.1 MSSubClass

It identifies the type of dwelling involved in the sale. (Discrete Numerical)

2.1.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(MSSubClass)
freq(MSSubClass)
## Frequencies  
## MSSubClass  
## Type: Integer  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          20    536     36.71          36.71     36.71          36.71
##          30     69      4.73          41.44      4.73          41.44
##          40      4      0.27          41.71      0.27          41.71
##          45     12      0.82          42.53      0.82          42.53
##          50    144      9.86          52.40      9.86          52.40
##          60    299     20.48          72.88     20.48          72.88
##          70     60      4.11          76.99      4.11          76.99
##          75     16      1.10          78.08      1.10          78.08
##          80     58      3.97          82.05      3.97          82.05
##          85     20      1.37          83.42      1.37          83.42
##          90     52      3.56          86.99      3.56          86.99
##         120     87      5.96          92.95      5.96          92.95
##         160     63      4.32          97.26      4.32          97.26
##         180     10      0.68          97.95      0.68          97.95
##         190     30      2.05         100.00      2.05         100.00
##        <NA>      0                               0.00         100.00
##       Total   1460    100.00         100.00    100.00         100.00
# Histogram and density plots
p1<-ggplot(data,aes(x=MSSubClass))+geom_histogram()+
  labs(title = "Histogram of MSSubClass",x="MSSubClass",y="Values")


p1

2.1.2 Missing values

# Missing Values
na_counts <- sum(is.na(data$MSSubClass))

print(na_counts)
## [1] 0

There are no missing values.

2.1.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = MSSubClass)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of MSSubClass", x = "Values", y = "")

# Count outliers
outliers <- boxplot.stats(data$MSSubClass)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 103
# Identify rows with outliers
outlier_rows <- which(data$MSSubClass %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 10 49 57 76 88 94 116 126 146 166 173 181 194 196 226 228 233 236 244 247 286 292 301 313 336 345 349 364 412 431 433 435 473 489 490 491 501 505 521 536 579 600 604 615 624 636 638 650 656 676 686 688 704 706 714 756 759 830 832 838 862 915 916 957 960 963 970 972 976 986 1008 1030 1031 1039 1040 1063 1069 1087 1089 1092 1105 1145 1161 1173 1187 1191 1192 1220 1237 1266 1267 1292 1298 1305 1335 1359 1365 1368 1379 1394 1417 1450 1453
p1

There are 103 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1357

2.1.4 Assumption of Normality

It isn’t continuous so we skip this step.

2.2 MSZoning

It identifies the general zoning classification of the sale. (Categorical variable)

2.2.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(MSZoning)
## Frequencies  
## MSZoning  
## Type: Character  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       C (all)     10      0.68           0.68      0.68           0.68
##            FV     65      4.45           5.14      4.45           5.14
##            RH     16      1.10           6.23      1.10           6.23
##            RL   1151     78.84          85.07     78.84          85.07
##            RM    218     14.93         100.00     14.93         100.00
##          <NA>      0                               0.00         100.00
##         Total   1460    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = MSZoning))+geom_bar()+
  coord_polar("y")+labs(x="MSZoning",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = MSZoning))+geom_bar()+
  labs(x="MSZoning",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values, so we will continue with the next variable.

2.3 LotFrontage

Linear feet of street connected to property

2.3.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(LotFrontage)
# Histogram and density plots
p1<-ggplot(data,aes(x=LotFrontage))+geom_density()+
  labs(title = "Density of LotFrontage",x="LotFrontage",y="Values")

p2<-ggplot(data,aes(x=LotFrontage))+geom_histogram()+
  labs(title = "Histogram of LotFrontage",x="LotFrontage",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

### Missing values

# Missing Values

na_counts <- sum(is.na(data$LotFrontage))

print(na_counts)
## [1] 248
print(na_counts/nrow(data))
## [1] 0.1827561

248 missing values -> 18.28%

It is a continuous variable, so we will analyze the random pattern with a Student test

2.3.1.1 Random pattern

We analyze the random pattern. In order to do this, we will study the homogeneity according to groups NA and non NA with a t-test.

Null Hypothesis (H0): The null hypothesis in a t-test is that there is no significant difference between the means of the two groups. Alternative Hypothesis (H1): The alternative hypothesis is that the true difference in means is not equal to 0.

# Create a subset for non-missing values
non_missing_values <- data$LotFrontage[!is.na(data$LotFrontage)]

# Perform a t-test for continuous variables
result <- t.test(non_missing_values, data$LotFrontage, alternative = "two.sided")
print(paste("Variable: LotFrontage"))
## [1] "Variable: LotFrontage"
print(result)
## 
##  Welch Two Sample t-test
## 
## data:  non_missing_values and data$LotFrontage
## t = 0, df = 2216, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.890914  1.890914
## sample estimates:
## mean of x mean of y 
##  72.62489  72.62489

t-statistic: The t-value is 0. The t-value measures the difference between the means of the two groups relative to the variability in the data. A t-value of 0 suggests that there is no significant difference between the means. Degrees of Freedom (df): The degrees of freedom for the Welch Two Sample t-test is 2216. p-value: The p-value is 1. This is the probability of observing a t-statistic as extreme as the one computed from the sample, assuming that the null hypothesis is true. A p-value of 1 suggests that there is no evidence to reject the null hypothesis. Confidence Interval: The 95 percent confidence interval for the difference in means is from -1.890914 to 1.890914. This interval contains 0, further supporting the idea that there is no significant difference between the means. Interpretation: In summary, based on the provided p-value (1), we fail to reject the null hypothesis. There is no significant evidence to suggest that the means of the two groups (missing and non-missing values of LotFrontage) are different.

In the case of homogeneity the pattern is random and, in this scenario, it is chosen to replace the NA with the mean

# Calculate the mean of non-missing values
mean_lot_frontage <- mean(data$LotFrontage, na.rm = TRUE)

# Replace missing values with the mean
data$LotFrontage[is.na(data$LotFrontage)] <- mean_lot_frontage

na_counts <- sum(is.na(data$LotFrontage))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

2.3.2 Final numeric descriptive analysis

describe(data$LotFrontage)
# Histogram and density plots
p1<-ggplot(data,aes(x=data$LotFrontage))+geom_density()+
  labs(title = "Density of LotFrontage",x="LotFrontage",y="Values")

p2<-ggplot(data,aes(x=data$LotFrontage))+geom_histogram()+
  labs(title = "Histogram of LotFrontage",x="LotFrontage",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.3.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = LotFrontage)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of LotFrontage", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$LotFrontage)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 47
# Identify rows with outliers
outlier_rows <- which(data$LotFrontage %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 36 41 82 83 142 151 162 184 216 241 257 258 270 280 284 290 297 401 415 485 614 751 769 829 846 849 872 898 901 922 940 987 1026 1027 1047 1070 1086 1087 1090 1099 1125 1183 1206 1243 1245 1264 1266

There are 47 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1310

2.3.4 Assumption of Normality

# Histogram representation of the column LotFrontage
par(mfcol = c(1, 1))

# Define the column name
j0 <- "LotFrontage"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)

# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)

2.3.4.1 Qqplot graphic

# Representation of normal quantiles for the column LotFrontage
par(mfrow = c(1, 1))

# Define the column name
j0 <- "LotFrontage"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])

This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.

2.3.4.2 Univariate normality test (Shapiro-Wilk)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.

data_tidy <- melt(data, value.name = "value")

result <- aggregate(value ~ variable, data = data_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)

# Print the row corresponding to LotFrontage
print(subset(result, variable == "LotFrontage"))
##      variable        value
## 2 LotFrontage 2.510955e-12

There is evidence of lack of univariate normality (p-value < 0.05).

2.4 LotArea

Lot size in square feet

2.4.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(LotArea)
# Histogram and density plots
p1<-ggplot(data,aes(x=LotArea))+geom_density()+
  labs(title = "Density of LotArea",x="LotArea",y="Values")

p2<-ggplot(data,aes(x=LotArea))+geom_histogram()+
  labs(title = "Histogram of LotArea",x="LotArea",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.4.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$LotArea))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values -> 0%

2.4.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = LotArea)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of LotArea", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$LotArea)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 59
# Identify rows with outliers
outlier_rows <- which(data$LotArea %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 50 62 104 110 114 161 167 221 242 268 305 306 320 327 340 372 377 401 407 471 485 499 504 528 590 593 595 597 619 620 632 651 692 749 766 772 793 829 850 854 943 952 1061 1102 1124 1131 1139 1145 1155 1174 1207 1238 1250 1263 1276 1283 1287 1299 1304

There are 59 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1251

2.4.4 Assumption of Normality

# Histogram representation of the column LotArea
par(mfcol = c(1, 1))

# Define the column name
j0 <- "LotArea"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)

# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)

2.4.4.1 Qqplot graphic

# Representation of normal quantiles for the column LotArea
par(mfrow = c(1, 1))

# Define the column name
j0 <- "LotArea"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])

This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.

2.4.4.2 Univariate normality test (Shapiro-Wilk)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.

data_tidy <- melt(data, value.name = "value")

# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)

# Print the row corresponding to LotArea
print(subset(result, variable == "LotArea"))
##   variable       value
## 3  LotArea 7.59414e-05

There is evidence of lack of univariate normality (p-value < 0.05).

2.5 OverallQual

Rates the overall material and finish of the house (1 - Very Poor , 10 - Very Excellent) (Ordinal Numerical Discrete)

2.5.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(OverallQual)
# Histogram and density plots
p1<-ggplot(data,aes(x=OverallQual))+geom_density()+
  labs(title = "Density of OverallQual",x="OverallQual",y="Values")

p2<-ggplot(data,aes(x=OverallQual))+geom_histogram()+
  labs(title = "Histogram of OverallQual",x="OverallQual",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.5.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$OverallQual))

print(na_counts)
## [1] 0

No missing values.

2.5.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = OverallQual)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of OverallQual", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$OverallQual)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 2
# Identify rows with outliers
outlier_rows <- which(data$OverallQual %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 317 455

There are 2 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1249

2.5.4 Assumption of Normality

It isn’t a continuous variable so we skip this step.

2.6 OverallCond

Rates the overall condition of the house (1 - Very Poor , 10 - Very Excellent)

2.6.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(OverallCond)
# Histogram and density plots
p1<-ggplot(data,aes(x=OverallCond))+geom_density()+
  labs(title = "Density of OverallCond",x="OverallCond",y="Values")

p2<-ggplot(data,aes(x=OverallCond))+geom_histogram()+
  labs(title = "Histogram of OverallCond",x="OverallCond",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.6.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$OverallCond))

print(na_counts)
## [1] 0

No missing values.

2.6.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = OverallCond)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of OverallCond", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$OverallCond)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 106
# Identify rows with outliers
outlier_rows <- which(data$OverallCond %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 2 15 24 78 81 84 102 160 165 166 190 208 214 230 256 273 294 298 319 324 326 330 338 342 350 362 369 372 382 391 411 420 428 432 443 452 484 498 506 519 531 545 576 577 594 606 608 632 636 640 645 677 722 723 726 735 740 757 769 789 791 812 819 821 836 841 846 851 859 864 865 886 890 894 920 924 938 964 965 971 978 984 989 992 994 995 1008 1011 1021 1031 1041 1050 1052 1054 1081 1083 1089 1132 1139 1160 1182 1186 1196 1209 1229 1247

There are 106 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1143

2.6.4 Assumption of Normality

It isn’t a continuous variable so we skip this step.

2.7 YearBuilt

Original construction date

2.7.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(YearBuilt)
# Histogram and density plots
p1<-ggplot(data,aes(x=YearBuilt))+geom_density()+
  labs(title = "Density of YearBuilt",x="YearBuilt",y="Values")

p2<-ggplot(data,aes(x=YearBuilt))+geom_histogram()+
  labs(title = "Histogram of YearBuilt",x="YearBuilt",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.7.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$YearBuilt))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values

2.7.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = YearBuilt)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of YearBuilt", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$YearBuilt)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 5
# Identify rows with outliers
outlier_rows <- which(data$YearBuilt %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 89 499 589 893 1058

There are 5 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1138

2.7.4 Assumption of Normality

It isn’t a continuous variable so we skip this step.

2.8 ExterQual

Evaluates the quality of the material on the exterior

2.8.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(ExterQual)
## Frequencies  
## ExterQual  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          Ex     52      3.56           3.56      3.56           3.56
##          Fa     14      0.96           4.52      0.96           4.52
##          Gd    488     33.42          37.95     33.42          37.95
##          TA    906     62.05         100.00     62.05         100.00
##        <NA>      0                               0.00         100.00
##       Total   1460    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = ExterQual))+geom_bar()+
  coord_polar("y")+labs(x="ExterQual",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = ExterQual))+geom_bar()+
  labs(x="ExterQual",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values, so we will continue with the next variable.

2.9 ExterCond

Evaluates the present condition of the material on the exterior

2.9.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(ExterCond)
## Frequencies  
## ExterCond  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          Ex      3     0.205          0.205     0.205          0.205
##          Fa     28     1.918          2.123     1.918          2.123
##          Gd    146    10.000         12.123    10.000         12.123
##          Po      1     0.068         12.192     0.068         12.192
##          TA   1282    87.808        100.000    87.808        100.000
##        <NA>      0                              0.000        100.000
##       Total   1460   100.000        100.000   100.000        100.000
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = ExterCond))+geom_bar()+
  coord_polar("y")+labs(x="ExterCond",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = ExterCond))+geom_bar()+
  labs(x="ExterCond",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values, so we will continue with the next variable.

2.10 GarageQual

Garage quality

2.10.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(GarageQual)
## Frequencies  
## GarageQual  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##          Ex      3      0.22           0.22      0.21           0.21
##          Fa     48      3.48           3.70      3.29           3.49
##          Gd     14      1.02           4.71      0.96           4.45
##          Po      3      0.22           4.93      0.21           4.66
##          TA   1311     95.07         100.00     89.79          94.45
##        <NA>     81                               5.55         100.00
##       Total   1460    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = GarageQual))+geom_bar()+
  coord_polar("y")+labs(x="GarageQual",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = GarageQual))+geom_bar()+
  labs(x="GarageQual",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

NA values in this variable mean that there isn’t a garage, so we don’t have to impute those values. In order to perform a correct implementation of the classification methods, we transform this variable to a numerical ordering.

# Create a mapping for the transformation
garage_qual_mapping <- c(TA = 3, Po = 1, Gd = 4, Fa = 2, Ex = 5)

# Transform GarageQual variable
data <- data %>% 
  mutate(GarageQual_numeric = ifelse(is.na(GarageQual), 0, garage_qual_mapping[as.character(GarageQual)]))

# Drop the GarageQual variable
data <- select(data, -GarageQual)

# Check the transformed data
print(data$GarageQual_numeric)
##    [1] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 0 3 3
##   [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3
##   [75] 0 3 3 3 3 3 2 0 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3
##  [112] 3 3 3 3 0 3 3 3 3 2 3 0 2 3 3 3 2 0 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 2
##  [149] 2 3 3 3 2 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [186] 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 3 3 3 3 3 3
##  [223] 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
##  [260] 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [297] 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [334] 3 3 3 3 3 0 3 3 3 4 3 0 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 0 3 3 3 3 3 3 3
##  [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 3 3 3 3
##  [408] 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3
##  [445] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [482] 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3
##  [519] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [556] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 0 3 3 3 3
##  [593] 3 3 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 2 3 3 3 3 3 3 3
##  [630] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3
##  [667] 4 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [704] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3
##  [741] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 0 3 3 3 3 0 0 3 3 0 3 3 3 3 3 3 3 3
##  [778] 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 0 0 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2
##  [815] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [852] 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
##  [889] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 2 3 3 3 3 3 3 3 3 2 0 3 3 3 3
##  [926] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 0
##  [963] 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3
## [1000] 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 0 3 0 0
## [1037] 2 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [1074] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 0 3 3 3 3 3 3 3 3 3 3 3 3
## [1111] 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 2 0 3 3 3 3 3 3
# Classical numeric descriptive analysis
describe(data$GarageQual_numeric)
# Histogram and density plots
p1<-ggplot(data,aes(x=GarageQual_numeric))+geom_density()+
  labs(title = "Density of GarageQual_numeric",x="GarageQual_numeric",y="Values")

p2<-ggplot(data,aes(x=GarageQual_numeric))+geom_histogram()+
  labs(title = "Histogram of GarageQual_numeric",x="GarageQual_numeric",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.10.2 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = GarageQual_numeric)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of GarageQual_numeric", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$GarageQual_numeric)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 92
# Identify rows with outliers
outlier_rows <- which(data$GarageQual_numeric %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 8 26 35 68 75 81 82 90 103 116 121 123 124 128 129 136 148 149 153 159 171 196 216 231 244 249 262 310 339 343 345 357 363 388 394 403 414 442 456 459 484 502 514 556 580 588 595 618 622 641 657 667 668 721 729 756 760 765 766 769 788 793 794 805 814 861 885 889 908 911 920 921 951 962 963 975 998 1000 1023 1033 1035 1036 1037 1041 1049 1074 1090 1098 1118 1127 1131 1132

There are 92 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1046
print(length(unique(data$GarageQual_numeric)))
## [1] 1

Since removing the outliers results in a degenerated variable, we will remove it from the data, as it will not provide any new information in order to classify the sale price.

# Drop the GarageQual_numeric variable
data <- select(data, -GarageQual_numeric)

2.11 GrLivArea

Above grade (ground) living area square feet

2.11.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(GrLivArea)
# Histogram and density plots
p1<-ggplot(data,aes(x=GrLivArea))+geom_density()+
  labs(title = "Density of GrLivArea",x="GrLivArea",y="Values")

p2<-ggplot(data,aes(x=GrLivArea))+geom_histogram()+
  labs(title = "Histogram of GrLivArea",x="GrLivArea",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.11.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$GrLivArea))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values

2.11.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = GrLivArea)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of GrLivArea", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$GrLivArea)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 17
# Identify rows with outliers
outlier_rows <- which(data$GrLivArea %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 47 89 230 231 355 377 439 464 582 587 704 743 757 830 848 971 994

There are 17 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1029

2.11.4 Assumption of Normality

# Histogram representation of the column GrLivArea
par(mfcol = c(1, 1))

# Define the column name
j0 <- "GrLivArea"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)

# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)

2.11.4.1 Qqplot graphic

# Representation of normal quantiles for the column GrLivArea
par(mfrow = c(1, 1))

# Define the column name
j0 <- "GrLivArea"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])

This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.

2.11.4.2 Univariate normality test (Shapiro-Wilk)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.

# Print the row corresponding to GrLivArea
print(subset(result, variable == "GrLivArea"))
##    variable        value
## 7 GrLivArea 1.034275e-16
data_tidy <- melt(data, value.name = "value")

# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)

# Print the row corresponding to GrLivArea
print(subset(result, variable == "GrLivArea"))
##    variable        value
## 7 GrLivArea 1.505932e-11

There is evidence of lack of univariate normality (p-value < 0.05).

2.12 X1stFlrSF

First Floor square feet

2.12.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(X1stFlrSF)
# Histogram and density plots
p1<-ggplot(data,aes(x=X1stFlrSF))+geom_density()+
  labs(title = "Density of X1stFlrSF",x="X1stFlrSF",y="Values")

p2<-ggplot(data,aes(x=X1stFlrSF))+geom_histogram()+
  labs(title = "Histogram of X1stFlrSF",x="X1stFlrSF",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.12.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$X1stFlrSF))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values

2.12.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = X1stFlrSF)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of X1stFlrSF", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$X1stFlrSF)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 9
# Identify rows with outliers
outlier_rows <- which(data$X1stFlrSF %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers: 57 127 161 310 642 650 743 853 970

There are 13 outliers, as there enough data records, we will eliminate them.

# Remove rows with outliers
data <- data[-outlier_rows, ]

# Print a message after removing outliers
cat("Outliers removed. New number of rows:", nrow(data), "\n")
## Outliers removed. New number of rows: 1020

2.12.4 Assumption of Normality

# Histogram representation of the column X1stFlrSF
par(mfcol = c(1, 1))

# Define the column name
j0 <- "X1stFlrSF"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)

# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)

2.12.4.1 Qqplot graphic

# Representation of normal quantiles for the column X1stFlrSF
par(mfrow = c(1, 1))

# Define the column name
j0 <- "X1stFlrSF"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])

This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.

2.12.4.2 Univariate normality test (Shapiro-Wilk)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.

data_tidy <- melt(data, value.name = "value")

# Use the aggregate function with the correct FUN argument
result <- aggregate(value ~ variable, data = data_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)

# Print the row corresponding to X1stFlrSF
print(subset(result, variable == "X1stFlrSF"))
##    variable        value
## 8 X1stFlrSF 2.088141e-15

There is evidence of lack of univariate normality (p-value < 0.05).

2.13 X2ndFlrSF

Second floor square feet

2.13.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(X2ndFlrSF)
# Histogram and density plots
p1<-ggplot(data,aes(x=X2ndFlrSF))+geom_density()+
  labs(title = "Density of X2ndFlrSF",x="X2ndFlrSF",y="Values")

p2<-ggplot(data,aes(x=X2ndFlrSF))+geom_histogram()+
  labs(title = "Histogram of X2ndFlrSF",x="X2ndFlrSF",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.13.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$X2ndFlrSF))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values

2.13.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = X2ndFlrSF)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of X2ndFlrSF", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$X2ndFlrSF)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 0
# Identify rows with outliers
outlier_rows <- which(data$X2ndFlrSF %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers:

There are 0 outliers.

2.13.4 Assumption of Normality

# Histogram representation of the column X1stFlrSF
par(mfcol = c(1, 1))

# Define the column name
j0 <- "X2ndFlrSF"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a histogram for the entire column
hist(data[, j0], prob = TRUE, col = grey(0.8), main = paste("Histogram of", j0), xlab = j0)

# Overlay normal distribution curve
lines(x0, dnorm(x0, mean(data[, j0], na.rm = TRUE), sd(data[, j0], na.rm = TRUE)), col = "blue", lwd = 2)

2.13.4.1 Qqplot graphic

# Representation of normal quantiles for the column X2ndFlrSF
par(mfrow = c(1, 1))

# Define the column name
j0 <- "X2ndFlrSF"

# Set the sequence for the x-axis
x0 <- seq(min(data[, j0], na.rm = TRUE), max(data[, j0], na.rm = TRUE), length.out = 50)

# Create a quantile-quantile plot
qqnorm(data[, j0], main = paste("Q-Q Plot of", j0), pch = 19, col = 2)
qqline(data[, j0])

This exploratory analysis can give us an idea of the possible normal distribution of the univariate variables, but it is always better to do the respective normality tests.

2.13.4.2 Univariate normality test (Shapiro-Wilk)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilk test is less than 0.05. Otherwise the assumption of normality of the data is not rejected.

# Melt the data
data_tidy <- melt(data, value.name = "value")

result <- aggregate(value ~ variable, data = data_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)

# Print the row corresponding to X2ndFlrSF
print(subset(result, variable == "X2ndFlrSF"))
##    variable        value
## 9 X2ndFlrSF 5.323629e-38
print(result)
##       variable        value
## 1   MSSubClass 7.504974e-32
## 2  LotFrontage 9.076679e-12
## 3      LotArea 2.162218e-04
## 4  OverallQual 1.520600e-19
## 5  OverallCond 7.358633e-38
## 6    YearBuilt 1.316966e-24
## 7    GrLivArea 2.963722e-11
## 8    X1stFlrSF 2.088141e-15
## 9    X2ndFlrSF 5.323629e-38
## 10      YrSold 1.243031e-25
## 11   SalePrice 5.007286e-22

There is evidence of lack of univariate normality (p-value < 0.05).

2.14 YrSold

Year Sold (YYYY)

2.14.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(YrSold)
# Histogram and density plots
p1<-ggplot(data,aes(x=YrSold))+geom_density()+
  labs(title = "Density of YrSold",x="YrSold",y="Values")

p2<-ggplot(data,aes(x=YrSold))+geom_histogram()+
  labs(title = "Histogram of YrSold",x="YrSold",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

2.14.2 Missing values

# Missing Values

na_counts <- sum(is.na(data$YrSold))

print(na_counts)
## [1] 0
print(na_counts/nrow(data))
## [1] 0

0 missing values

2.14.3 Outliers

# Boxplot to visualize outliers
p1 <- ggplot(data, aes(x = YrSold)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1, outlier.size = 2) +
  coord_flip() +
  labs(title = "Boxplot of YrSold", x = "Values", y = "")

ggarrange(p1, nrow=1, common.legend = FALSE)

# Count outliers
outliers <- boxplot.stats(data$YrSold)$out

# Display the count of outliers
cat("Number of outliers:", length(outliers), "\n")
## Number of outliers: 0
# Identify rows with outliers
outlier_rows <- which(data$YrSold %in% outliers)

# Print the indices of rows with outliers
cat("Indices of rows with outliers:", outlier_rows, "\n")
## Indices of rows with outliers:

There are 0 outliers.

2.14.4 Assumption of Normality

It isn’t a continuous variable so we skip this step.

2.15 SaleType

Type of sale

2.15.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(SaleType)
## Frequencies  
## SaleType  
## Type: Character  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##         COD     43      2.95           2.95      2.95           2.95
##         Con      2      0.14           3.08      0.14           3.08
##       ConLD      9      0.62           3.70      0.62           3.70
##       ConLI      5      0.34           4.04      0.34           4.04
##       ConLw      5      0.34           4.38      0.34           4.38
##         CWD      4      0.27           4.66      0.27           4.66
##         New    122      8.36          13.01      8.36          13.01
##         Oth      3      0.21          13.22      0.21          13.22
##          WD   1267     86.78         100.00     86.78         100.00
##        <NA>      0                               0.00         100.00
##       Total   1460    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = SaleType))+geom_bar()+
  coord_polar("y")+labs(x="SaleType",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = SaleType))+geom_bar()+
  labs(x="SaleType",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values, so we will continue with the next variable.

2.16 SaleCondition

Condition of sale

2.16.1 Numeric descriptive analysis

# Basic descriptive statistics
freq(SaleCondition)
## Frequencies  
## SaleCondition  
## Type: Character  
## 
##                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------- ------ --------- -------------- --------- --------------
##       Abnorml    101      6.92           6.92      6.92           6.92
##       AdjLand      4      0.27           7.19      0.27           7.19
##        Alloca     12      0.82           8.01      0.82           8.01
##        Family     20      1.37           9.38      1.37           9.38
##        Normal   1198     82.05          91.44     82.05          91.44
##       Partial    125      8.56         100.00      8.56         100.00
##          <NA>      0                               0.00         100.00
##         Total   1460    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = SaleCondition))+geom_bar()+
  coord_polar("y")+labs(x="SaleCondition",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = SaleCondition))+geom_bar()+
  labs(x="SaleCondition",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values, so we will continue with the next variable.

2.17 SalePrice

Sale price of the property

2.17.1 Numeric descriptive analysis

# Classical numeric descriptive analysis
describe(SalePrice)
# Histogram and density plots
p1<-ggplot(data,aes(x=SalePrice))+geom_density()+
  labs(title = "Density of SalePrice",x="SalePrice",y="Values")

p2<-ggplot(data,aes(x=SalePrice))+geom_histogram()+
  labs(title = "Histogram of SalePrice",x="SalePrice",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

We can perform a log transformation to correct the skewness of the variable

# Classical numeric descriptive analysis
describe(log(SalePrice))
# Histogram and density plots
p1<-ggplot(data,aes(x=log(SalePrice)))+geom_density()+
  labs(title = "Density of SalePrice",x="SalePrice",y="Values")

p2<-ggplot(data,aes(x=log(SalePrice)))+geom_histogram()+
  labs(title = "Histogram of SalePrice",x="SalePrice",y="Values")


# This function controls the graphical output device
ggarrange(p1,p2, nrow=1, common.legend = FALSE)

We will split this target variable into two categories (intervals) of equal frequency

# Split the target variable into two categories of equal frequency
data$PriceCategory <- ntile(data$SalePrice, 2)

# Name each category
data$PriceCategory <- ifelse(data$PriceCategory == 1, "Cheap", "Expensive")

# Numeric descriptive analysis and graphics
describe(data$SalePrice)
p1 <- ggplot(data, aes(x = SalePrice)) + 
      geom_density() + 
      labs(title = "Density of SalePrice", x = "SalePrice", y = "Density")

p2 <- ggplot(data, aes(x = SalePrice)) + 
      geom_histogram(bins = 30) + 
      labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Count")

ggarrange(p1, p2, nrow = 1, common.legend = FALSE)

# Verify the new variable
table(data$PriceCategory)
## 
##     Cheap Expensive 
##       510       510
# Drop the SalePrice variable
data <- select(data, -SalePrice)

# To verify that the column has been removed, you can view the structure of the dataframe
str(data)
## 'data.frame':    1020 obs. of  16 variables:
##  $ MSSubClass   : int  60 60 70 60 50 20 60 20 60 20 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : num  65 68 60 84 85 ...
##  $ LotArea      : int  8450 11250 9550 14260 14115 10084 10382 11200 11924 12968 ...
##  $ OverallQual  : int  7 7 7 8 5 8 7 5 9 5 ...
##  $ OverallCond  : int  5 5 5 5 5 5 6 5 5 6 ...
##  $ YearBuilt    : int  2003 2001 1915 2000 1993 2004 1973 1965 2005 1962 ...
##  $ ExterQual    : chr  "Gd" "Gd" "TA" "Gd" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ GrLivArea    : int  1710 1786 1717 2198 1362 1694 2090 1040 2324 912 ...
##  $ X1stFlrSF    : int  856 920 961 1145 796 1694 1107 1040 1182 912 ...
##  $ X2ndFlrSF    : int  854 866 756 1053 566 0 983 0 1142 0 ...
##  $ YrSold       : int  2008 2008 2006 2008 2009 2007 2009 2008 2006 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Abnorml" "Normal" ...
##  $ PriceCategory: chr  "Expensive" "Expensive" "Cheap" "Expensive" ...
# Basic descriptive statistics
freq(data$PriceCategory)
## Frequencies  
## data$PriceCategory  
## Type: Character  
## 
##                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
##           Cheap    510     50.00          50.00     50.00          50.00
##       Expensive    510     50.00         100.00     50.00         100.00
##            <NA>      0                               0.00         100.00
##           Total   1020    100.00         100.00    100.00         100.00
# Pie chart and bar graph
p1<-ggplot(data, aes(x=factor(1), fill = PriceCategory))+geom_bar()+
  coord_polar("y")+labs(x="PriceCategory",y="%")
p2<-ggplot(data, aes(x=factor(1), fill = PriceCategory))+geom_bar()+
  labs(x="PriceCategory",y="%")

# This function controls the graphical output device
ggarrange(p1,p2,nrow = 1,ncol=2, common.legend = TRUE)

There aren’t any missing values.

We have finished our univariate exploratory analysis. We continue with the Multivariate exploratory analysis.

3 Multivariate exploratory analysis

First, we check the assumptions of correlation between variables with the Bartlett test.

3.1 Correlated variables

According to the numerical results below, it is observed that the data are correlated both at the sample level (see correlation matrix) and at the population level (Bartlett’s sphericity test is significant).

###############################
# Correlation at sample level #
###############################

# Select only numeric columns
numeric_data <- data[sapply(data, is.numeric)]

# Are the variables correlated at sample level?
correlation_matrix <- cor(numeric_data)

# Display the first 6x6 portion of the correlation matrix
correlation_matrix
##              MSSubClass LotFrontage     LotArea  OverallQual OverallCond
## MSSubClass   1.00000000 -0.28850505 -0.34633395  0.168087704 -0.15356723
## LotFrontage -0.28850505  1.00000000  0.49103272  0.235083104 -0.06579004
## LotArea     -0.34633395  0.49103272  1.00000000  0.190320939 -0.02338120
## OverallQual  0.16808770  0.23508310  0.19032094  1.000000000 -0.28583429
## OverallCond -0.15356723 -0.06579004 -0.02338120 -0.285834289  1.00000000
## YearBuilt    0.14716475  0.19948542  0.11863879  0.661596262 -0.46473667
## GrLivArea    0.17246027  0.33840880  0.38472039  0.582411185 -0.22098362
## X1stFlrSF   -0.12798840  0.25156502  0.26440761  0.448576362 -0.20670989
## X2ndFlrSF    0.27127630  0.15207657  0.18506757  0.247062916 -0.06785397
## YrSold      -0.01286138  0.00965220 -0.04087105  0.002003449  0.03533318
##               YearBuilt   GrLivArea    X1stFlrSF    X2ndFlrSF       YrSold
## MSSubClass   0.14716475  0.17246027 -0.127988399  0.271276296 -0.012861381
## LotFrontage  0.19948542  0.33840880  0.251565016  0.152076566  0.009652200
## LotArea      0.11863879  0.38472039  0.264407609  0.185067574 -0.040871046
## OverallQual  0.66159626  0.58241119  0.448576362  0.247062916  0.002003449
## OverallCond -0.46473667 -0.22098362 -0.206709895 -0.067853974  0.035333184
## YearBuilt    1.00000000  0.32606440  0.336130147  0.083630273  0.004240250
## GrLivArea    0.32606440  1.00000000  0.388102677  0.704222309 -0.013740015
## X1stFlrSF    0.33613015  0.38810268  1.000000000 -0.376068340 -0.005201827
## X2ndFlrSF    0.08363027  0.70422231 -0.376068340  1.000000000 -0.009587892
## YrSold       0.00424025 -0.01374002 -0.005201827 -0.009587892  1.000000000
# Compute the determinant of the correlation matrix (for full matrix)
det(correlation_matrix)
## [1] 0.0006431478
###################################
# Correlation at population level #
###################################

# Bartlett's sphericity test:
# This test checks whether the correlations are significantly different from 0
# The null hypothesis is H_0; det(R)=1 means the variables are uncorrelated 
# R denotes the correlation matrix
# cortest.bartlett function in the package pysch performs this test
# This function works with standardized data.

# Standardization
numeric_data_scale<-scale(numeric_data)

# Bartlett's sphericity test
cortest.bartlett(cor(numeric_data_scale))
## $chisq
## [1] 696.9431
## 
## $p.value
## [1] 9.28575e-118
## 
## $df
## [1] 45

Different graphical outputs are shown below that provide an intuitive idea of the correlation between the variables.

# Polychoric correlation matrix
poly_cor<-hetcor(numeric_data)$correlations

ggcorrplot(poly_cor, type="lower",hc.order=T)

# Another interesting visual representation is the following
corrplot(cor(numeric_data), order = "hclust", tl.col='black', tl.cex=1)

3.2 Absence of outliers and missing values

Done in previous sections.

3.3 Principal Component Analysis

3.3.1 Obtaining

# Parameters 'scale' and 'center' are set to TRUE to consider standardized data
PCA<-prcomp(numeric_data, scale=T, center = T)

# The field 'rotation' of the 'PCA' object is a matrix 
# Its columns are the coefficients of the principal components
# Indicates the weight of each variable in the corresponding principal component
PCA$rotation
##                     PC1          PC2         PC3          PC4         PC5
## MSSubClass  -0.04939892 -0.612183913  0.08492296 -0.001724381 -0.23614639
## LotFrontage -0.29832446  0.377582778 -0.23284795  0.033120106  0.22738320
## LotArea     -0.28555488  0.410844194 -0.31175988 -0.063712254  0.08520617
## OverallQual -0.46510644 -0.113538358  0.17329696  0.055240588 -0.23347625
## OverallCond  0.26395972  0.172250807 -0.32896903  0.109053081 -0.69413470
## YearBuilt   -0.39782368 -0.114372813  0.34625579  0.022633337  0.22673253
## GrLivArea   -0.47599950 -0.131942421 -0.27965660  0.019033090 -0.26227163
## X1stFlrSF   -0.31117519  0.315849190  0.40163925  0.012687484 -0.45923559
## X2ndFlrSF   -0.24327394 -0.375543201 -0.58613095  0.009858948  0.09705571
## YrSold       0.01305882  0.004949914  0.01260068  0.989325042  0.09179512
##                      PC6         PC7         PC8          PC9          PC10
## MSSubClass  -0.109772774 -0.50699006  0.53704011  0.048050668  8.229278e-04
## LotFrontage  0.268123202 -0.75559624 -0.12374617  0.057269306 -2.630894e-03
## LotArea     -0.188615340  0.25960338  0.72529456  0.110136422  6.402832e-05
## OverallQual  0.364478031  0.20452667 -0.09771754  0.706374719  1.082659e-03
## OverallCond  0.495443825  0.02190150  0.10787432 -0.202821254 -1.074622e-03
## YearBuilt    0.508924782  0.16788156  0.19217872 -0.576707138 -1.139130e-02
## GrLivArea   -0.305290803  0.04782402 -0.25957016 -0.245471309 -6.209161e-01
## X1stFlrSF   -0.374829381 -0.11352645 -0.10116179 -0.193892783  4.793991e-01
## X2ndFlrSF   -0.005033558  0.13586244 -0.17913630 -0.107910847  6.200807e-01
## YrSold      -0.097190601  0.02263710  0.04903243  0.009652686  3.040544e-05
# Standard deviations of each principal component
PCA$sdev
##  [1] 1.74325300 1.33490571 1.24615986 1.00332182 0.90133109 0.77844570
##  [7] 0.72139346 0.66261791 0.48872537 0.05317622

Each principal component is obtained in a simple way as a linear combination of all the variables with the coefficients indicated by the columns of the rotation matrix.

3.3.2 Explained variance rate

# The function 'summary' applied to the 'PCA' object provides relevant information
# - Standard deviations of each principal component
# - Proportion of variance explained and cummulative variance
summary(PCA)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     1.7433 1.3349 1.2462 1.0033 0.90133 0.7784 0.72139
## Proportion of Variance 0.3039 0.1782 0.1553 0.1007 0.08124 0.0606 0.05204
## Cumulative Proportion  0.3039 0.4821 0.6374 0.7380 0.81929 0.8799 0.93193
##                            PC8     PC9    PC10
## Standard deviation     0.66262 0.48873 0.05318
## Proportion of Variance 0.04391 0.02389 0.00028
## Cumulative Proportion  0.97583 0.99972 1.00000
# The following graph shows the proportion of explained variance
Explained_variance <- PCA$sdev^2 / sum(PCA$sdev^2)

p1<-ggplot(data = data.frame(Explained_variance, pc = 1:10),
  aes(x = pc, y = Explained_variance, fill=Explained_variance )) +
  geom_col(width = 0.3) + scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
  labs(x = "Principal component", y= "Proportion of variance")

# The following graph shows the proportion of cumulative explained variance
Cummulative_variance<-cumsum(Explained_variance)

p2<-ggplot( data = data.frame(Cummulative_variance, pc = 1:10),
  aes(x = pc, y = Cummulative_variance ,fill=Cummulative_variance )) +
  geom_col(width = 0.5) +  scale_y_continuous(limits = c(0,1)) + theme_bw() +
  labs(x = "Principal component",
       y = "Proportion of cumulative variance")

p1

p2

fviz_eig(PCA)

3.3.3 Appropriate number of principal components

We apply the rule of Abdi et al., only four principal components are considered, as can be deduced from the following code chunk.

#######################
# Rule of Abdi et al. #
#######################

# Variances
PCA$sdev^2
##  [1] 3.03893101 1.78197326 1.55291439 1.00665468 0.81239774 0.60597771
##  [7] 0.52040853 0.43906249 0.23885248 0.00282771
# Average of variances
mean(PCA$sdev^2)
## [1] 1

3.3.4 PCA graphical outputs of interest

3.3.4.1 Distances

# These graphical outputs show the projection of the variables in two dimensions
# Display the weight of the variable in the direction of the principal component 

# Projection of variables on the first and second principal components
p1 <- fviz_pca_var(PCA, axes = c(1, 2), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC1 vs PC2)") + theme_bw()

# Projection on the first and third principal components
p2 <- fviz_pca_var(PCA, axes = c(1, 3), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC1 vs PC3)") + theme_bw()

# Projection on the first and fourth principal components
p3 <- fviz_pca_var(PCA, axes = c(1, 4), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC1 vs PC4)") + theme_bw()

# Projection on the second and third principal components
p4 <- fviz_pca_var(PCA, axes = c(2, 3), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC2 vs PC3)") + theme_bw()

# Projection on the second and fourth principal components
p5 <- fviz_pca_var(PCA, axes = c(2, 4), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC2 vs PC4)") + theme_bw()

# Projection on the third and fourth principal components
p6 <- fviz_pca_var(PCA, axes = c(3, 4), repel = TRUE, col.var = "cos2",
                   legend.title = "Distance", title = "Variables (PC3 vs PC4)") + theme_bw()

# Displaying graphics
p1

p2

p3

p4

p5

p6

3.3.4.2 Observations and variance contribution

# It is also possible to represent the observations
# As well as identify with colors those observations that explain the greatest 
# variance of the principal components
p1<-fviz_pca_ind(PCA,axes = c(1,2),col.ind = "contrib", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()


p2<-fviz_pca_ind(PCA,axes=c(1,3),col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()

p3<-fviz_pca_ind(PCA,axes=c(1,4),col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()

p4<-fviz_pca_ind(PCA,axes=c(2,3),col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()

p5<-fviz_pca_ind(PCA,axes=c(2,4),col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()

p6<-fviz_pca_ind(PCA,axes=c(3,4),col.ind = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel=TRUE,legend.title="Contrib.var", title="Records")+theme_bw()

# Displaying graphics
p1

p2

p3

p4

p5

p6

Since there are so many rows, the previous outputs are not too clear.

3.3.4.3 Observations and variables with variance contribution

# Joint representation of variables and observations
# Relates the possible relationships between the contributions of the records
# to the variances of the components and the weight of the variables in each 
# principal component

p1<-fviz_pca(PCA,axes=c(1,2),alpha.ind ="contrib", col.var = "cos2",
         col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

p2<-fviz_pca(PCA,axes=c(1,3),alpha.ind ="contrib", 
         col.var = "cos2",col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

p3<-fviz_pca(PCA,axes=c(1,4),alpha.ind ="contrib", 
         col.var = "cos2",col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

p4<-fviz_pca(PCA,axes=c(2,3),alpha.ind ="contrib", 
         col.var = "cos2",col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

p5<-fviz_pca(PCA,axes=c(2,4),alpha.ind ="contrib", 
         col.var = "cos2",col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

p6<-fviz_pca(PCA,axes=c(3,4),alpha.ind ="contrib", 
         col.var = "cos2",col.ind="seagreen",
         gradient.cols = c("#FDF50E", "#FD960E", "#FD1E0E"),
         repel=TRUE, legend.title="Distancia")+theme_bw()

# Displaying graphics
p1

p2

p3

p4

p5

p6

3.3.4.4 Coordinates in the new reference system

Finally, since the object of this part was to reduce the dimension of the data set, it is possible to obtain the coordinates of the original data in the new reference system.

head(PCA$x)
##          PC1        PC2        PC3        PC4        PC5         PC6        PC7
## 1 -0.8452027 -1.6738113 -0.4179040  0.1597390  0.7328336  0.53822810  0.6112833
## 3 -1.3295188 -1.1303234 -0.8048391  0.1047746  0.7172377  0.23081222  0.7016935
## 4  0.3536070 -1.2804509 -1.3250047 -1.4367515 -0.4417883 -1.30933047  0.1670796
## 5 -3.1108652 -0.4084764 -1.5479228  0.1434477  0.3476182  0.01827479  0.3296982
## 6 -0.3612626  0.4427567 -1.2539408  0.6975673  1.8817162  0.05321400 -0.1994333
## 7 -1.8359116  1.1176454  1.4965455 -0.5418138 -0.3665633  0.17510923  0.4713932
##          PC8         PC9          PC10
## 1 -0.2628569 -0.13551171 -3.170197e-03
## 3  0.3812697 -0.05142864 -2.022434e-03
## 4 -0.4464533  1.73267727  3.571344e-02
## 5  0.5714314  0.27611780  8.193545e-05
## 6  1.3914121 -0.38731959 -6.923908e-03
## 7 -0.5925704  0.14855231 -2.013263e-03

3.4 Factorial Analysis

3.4.1 Appropriate number of latent factors

There are different criteria, among which the Scree plot (Cattel 1966) and parallel analysis (Horn 1965) stand out. According to the following graphical outputs, 3 is considered to be the optimal number of factors (parallel analysis).

# Scree plot
scree(poly_cor)

#Parallel analysis
fa.parallel(poly_cor,n.obs=100,fa="fa",fm="ml")

## Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

The factorial model with 3 factors implementing a varimax rotation to seek a simpler interpretation is performed.

modelo_varimax<-fa(poly_cor,nfactors = 3,rotate = "varimax",
                   fa="mle")

# The rotated factorial matrix is shown
print(modelo_varimax$loadings,cut=0) 
## 
## Loadings:
##             MR1    MR2    MR3   
## MSSubClass   0.221  0.245 -0.581
## LotFrontage  0.159  0.196  0.589
## LotArea      0.078  0.243  0.713
## OverallQual  0.773  0.284  0.104
## OverallCond -0.458 -0.072  0.066
## YearBuilt    0.739  0.108  0.016
## GrLivArea    0.470  0.699  0.266
## X1stFlrSF    0.599 -0.214  0.401
## X2ndFlrSF   -0.059  1.019 -0.086
## YrSold      -0.006 -0.014 -0.009
## 
##                  MR1   MR2   MR3
## SS loadings    2.016 1.828 1.447
## Proportion Var 0.202 0.183 0.145
## Cumulative Var 0.202 0.384 0.529

Visually we could make the effort to see what variables each correlates with one of the factors, but it is very tedious. So we use the following representation in diagram mode.

fa.diagram(modelo_varimax)

Another way to do the previous analysis.

# This function only performs the mle method
FA<-factanal(numeric_data,factors=3, rotation="varimax")
FA$loadings
## 
## Loadings:
##             Factor1 Factor2 Factor3
## MSSubClass   0.125           0.282 
## LotFrontage  0.202   0.310         
## LotArea      0.122   0.389         
## OverallQual  0.664   0.400         
## OverallCond -0.465                 
## YearBuilt    0.994                 
## GrLivArea    0.309   0.888   0.333 
## X1stFlrSF    0.404   0.565  -0.717 
## X2ndFlrSF            0.453   0.889 
## YrSold                             
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.976   1.729   1.507
## Proportion Var   0.198   0.173   0.151
## Cumulative Var   0.198   0.370   0.521

3.5 Multivariate normality

# Royston multivariate normality test
royston_test <- MVN::mvn(data = numeric_data, mvnTest = "royston", multivariatePlot = "qq")

royston_test$multivariateNormality
# Henze-Zirkler multivariate normality test
hz_test <- MVN::mvn(data = numeric_data, mvnTest = "hz")
hz_test$multivariateNormality

We find evidence of a lack of multivariate normality at the 5% level.

3.6 Classifier using linear discriminant analysis

We will work with a partition of the dataset, as a training set (80% of the data) on which we will perform the linear discriminant analysis, and another partition, as a test set (20%), with which we perform the validation of the model.

# The response variable has to be an object of the class 'factor' of R language
data$PriceCategory<-as.factor(data$PriceCategory)


set.seed(3)

# Partitioning the dataset: training (80%) + test (20%)
trainIndex<-createDataPartition(data$PriceCategory,p=0.80)$Resample
datos_train<-data[trainIndex,]
datos_test<-data[-trainIndex,]

datos_train
datos_test

The lda function of the MASS package adjusts this model.

modelo_lda <- lda(formula = PriceCategory ~ LotFrontage + MSSubClass + MSZoning + ExterQual + ExterCond + SaleType + SaleCondition + LotArea + OverallQual + OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold ,data = datos_train)
modelo_lda
## Call:
## lda(PriceCategory ~ LotFrontage + MSSubClass + MSZoning + ExterQual + 
##     ExterCond + SaleType + SaleCondition + LotArea + OverallQual + 
##     OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + 
##     YrSold, data = datos_train)
## 
## Prior probabilities of groups:
##     Cheap Expensive 
##       0.5       0.5 
## 
## Group means:
##           LotFrontage MSSubClass MSZoningFV MSZoningRH MSZoningRL MSZoningRM
## Cheap        67.79471   46.06618 0.00000000 0.01715686  0.7696078 0.20098039
## Expensive    73.35384   53.99510 0.08823529 0.00245098  0.8946078 0.01470588
##           ExterQualFa ExterQualGd ExterQualTA ExterCondGd ExterCondTA
## Cheap     0.009803922  0.07843137   0.9117647  0.10784314   0.8750000
## Expensive 0.000000000  0.68137255   0.2622549  0.04656863   0.9534314
##           SaleTypeCon SaleTypeConLD SaleTypeConLI SaleTypeConLw SaleTypeCWD
## Cheap     0.000000000   0.002450980   0.002450980    0.00245098 0.004901961
## Expensive 0.004901961   0.004901961   0.004901961    0.00245098 0.002450980
##           SaleTypeNew SaleTypeOth SaleTypeWD SaleConditionAdjLand
## Cheap      0.01960784  0.00245098  0.9142157           0.00245098
## Expensive  0.16421569  0.00000000  0.8039216           0.00000000
##           SaleConditionAlloca SaleConditionFamily SaleConditionNormal
## Cheap             0.009803922         0.022058824           0.8553922
## Expensive         0.002450980         0.004901961           0.7916667
##           SaleConditionPartial   LotArea OverallQual OverallCond YearBuilt
## Cheap               0.01960784  8880.846    5.276961    5.664216  1959.056
## Expensive           0.16666667 10205.684    7.132353    5.235294  1995.265
##           GrLivArea X1stFlrSF X2ndFlrSF   YrSold
## Cheap      1234.218  1054.777  172.2819 2007.782
## Expensive  1758.924  1298.103  459.8799 2007.811
## 
## Coefficients of linear discriminants:
##                                LD1
## LotFrontage          -1.342848e-02
## MSSubClass           -2.456770e-03
## MSZoningFV           -3.106813e-01
## MSZoningRH           -1.301194e+00
## MSZoningRL           -8.116036e-01
## MSZoningRM           -1.177599e+00
## ExterQualFa           6.290882e-01
## ExterQualGd           6.985497e-01
## ExterQualTA          -6.506843e-02
## ExterCondGd           1.856076e-01
## ExterCondTA           2.793930e-01
## SaleTypeCon           8.240608e-01
## SaleTypeConLD        -7.382923e-02
## SaleTypeConLI        -2.245549e-01
## SaleTypeConLw         4.856318e-01
## SaleTypeCWD          -7.588695e-01
## SaleTypeNew          -4.715859e-01
## SaleTypeOth           6.837812e-01
## SaleTypeWD            4.329793e-01
## SaleConditionAdjLand -3.115268e-02
## SaleConditionAlloca  -5.066976e-01
## SaleConditionFamily  -3.962297e-01
## SaleConditionNormal   7.713446e-02
## SaleConditionPartial  8.615625e-01
## LotArea               3.085357e-05
## OverallQual           3.331124e-01
## OverallCond           1.781541e-01
## YearBuilt             2.393942e-02
## GrLivArea             1.358962e-03
## X1stFlrSF             5.693947e-04
## X2ndFlrSF             3.646943e-04
## YrSold                8.389040e-03
nuevas_observaciones <- datos_test
predict(object = modelo_lda, newdata = nuevas_observaciones)
## $class
##   [1] Expensive Cheap     Cheap     Cheap     Cheap     Cheap     Expensive
##   [8] Cheap     Cheap     Cheap     Cheap     Cheap     Expensive Cheap    
##  [15] Expensive Cheap     Expensive Cheap     Cheap     Expensive Expensive
##  [22] Expensive Cheap     Cheap     Expensive Expensive Cheap     Cheap    
##  [29] Expensive Expensive Cheap     Expensive Cheap     Cheap     Expensive
##  [36] Expensive Expensive Expensive Cheap     Expensive Cheap     Cheap    
##  [43] Cheap     Cheap     Cheap     Cheap     Expensive Cheap     Expensive
##  [50] Expensive Expensive Cheap     Cheap     Cheap     Cheap     Expensive
##  [57] Expensive Cheap     Expensive Expensive Expensive Cheap     Expensive
##  [64] Expensive Cheap     Cheap     Cheap     Expensive Expensive Expensive
##  [71] Cheap     Expensive Cheap     Expensive Cheap     Expensive Cheap    
##  [78] Expensive Cheap     Cheap     Cheap     Cheap     Cheap     Expensive
##  [85] Cheap     Cheap     Cheap     Cheap     Expensive Cheap     Expensive
##  [92] Cheap     Cheap     Expensive Expensive Cheap     Cheap     Cheap    
##  [99] Expensive Cheap     Expensive Expensive Cheap     Cheap     Expensive
## [106] Cheap     Cheap     Cheap     Expensive Expensive Cheap     Expensive
## [113] Cheap     Expensive Expensive Cheap     Expensive Expensive Expensive
## [120] Cheap     Cheap     Expensive Expensive Cheap     Expensive Cheap    
## [127] Expensive Cheap     Cheap     Cheap     Expensive Expensive Expensive
## [134] Cheap     Expensive Cheap     Cheap     Cheap     Cheap     Expensive
## [141] Cheap     Expensive Expensive Cheap     Expensive Cheap     Expensive
## [148] Expensive Cheap     Expensive Cheap     Cheap     Expensive Expensive
## [155] Expensive Expensive Expensive Expensive Cheap     Cheap     Expensive
## [162] Cheap     Cheap     Cheap     Expensive Cheap     Expensive Cheap    
## [169] Expensive Expensive Expensive Cheap     Expensive Cheap     Cheap    
## [176] Cheap     Expensive Cheap     Cheap     Cheap     Cheap     Cheap    
## [183] Cheap     Cheap     Expensive Cheap     Expensive Expensive Expensive
## [190] Expensive Expensive Expensive Expensive Expensive Expensive Cheap    
## [197] Cheap     Expensive Cheap     Cheap     Cheap     Cheap     Expensive
## [204] Cheap    
## Levels: Cheap Expensive
## 
## $posterior
##             Cheap    Expensive
## 14   3.081527e-02 9.691847e-01
## 29   6.125823e-01 3.874177e-01
## 41   9.663121e-01 3.368793e-02
## 45   9.889462e-01 1.105377e-02
## 53   9.994850e-01 5.150359e-04
## 55   9.852142e-01 1.478584e-02
## 58   1.377039e-02 9.862296e-01
## 72   9.960880e-01 3.911951e-03
## 74   9.930967e-01 6.903331e-03
## 77   9.981125e-01 1.887512e-03
## 78   9.960688e-01 3.931155e-03
## 84   9.990521e-01 9.478909e-04
## 85   3.902191e-01 6.097809e-01
## 98   9.984298e-01 1.570150e-03
## 102  7.897845e-02 9.210215e-01
## 103  9.858893e-01 1.411074e-02
## 104  4.463689e-01 5.536311e-01
## 108  9.973503e-01 2.649738e-03
## 111  8.324230e-01 1.675770e-01
## 131  6.957034e-02 9.304297e-01
## 134  2.334591e-02 9.766541e-01
## 142  3.514416e-03 9.964856e-01
## 143  9.972669e-01 2.733086e-03
## 157  9.925769e-01 7.423125e-03
## 163  2.420339e-02 9.757966e-01
## 168  4.085761e-04 9.995914e-01
## 187  6.936963e-01 3.063037e-01
## 189  9.943953e-01 5.604674e-03
## 193  2.280750e-02 9.771925e-01
## 200  1.665426e-03 9.983346e-01
## 205  9.946169e-01 5.383127e-03
## 213  9.209495e-03 9.907905e-01
## 214  8.632973e-01 1.367027e-01
## 216  9.827641e-01 1.723593e-02
## 220  7.952515e-02 9.204748e-01
## 221  1.797721e-02 9.820228e-01
## 239  2.602022e-02 9.739798e-01
## 241  6.632742e-04 9.993367e-01
## 260  9.993340e-01 6.659559e-04
## 271  3.939904e-04 9.996060e-01
## 274  7.425348e-01 2.574652e-01
## 275  9.881600e-01 1.184004e-02
## 276  5.485902e-01 4.514098e-01
## 289  9.974449e-01 2.555118e-03
## 290  9.931615e-01 6.838546e-03
## 295  8.651106e-01 1.348894e-01
## 299  1.784830e-01 8.215170e-01
## 304  9.896898e-01 1.031022e-02
## 317  9.246537e-04 9.990753e-01
## 337  1.882270e-04 9.998118e-01
## 339  1.333471e-02 9.866653e-01
## 353  9.969227e-01 3.077311e-03
## 356  5.873161e-01 4.126839e-01
## 362  9.919363e-01 8.063747e-03
## 373  9.883371e-01 1.166288e-02
## 375  2.759836e-03 9.972402e-01
## 386  1.929004e-02 9.807100e-01
## 397  9.942107e-01 5.789256e-03
## 410  1.850165e-04 9.998150e-01
## 415  1.216942e-03 9.987831e-01
## 424  4.868152e-05 9.999513e-01
## 425  9.807072e-01 1.929276e-02
## 436  2.863908e-03 9.971361e-01
## 445  8.006139e-03 9.919939e-01
## 446  9.137502e-01 8.624979e-02
## 450  9.995545e-01 4.455108e-04
## 467  8.992516e-01 1.007484e-01
## 470  5.653525e-02 9.434647e-01
## 471  4.386637e-02 9.561336e-01
## 472  1.140252e-01 8.859748e-01
## 476  9.961019e-01 3.898109e-03
## 479  1.345439e-03 9.986546e-01
## 486  9.403224e-01 5.967758e-02
## 516  5.228055e-04 9.994772e-01
## 523  9.158069e-01 8.419308e-02
## 526  8.730629e-03 9.912694e-01
## 533  9.964766e-01 3.523356e-03
## 542  2.069119e-04 9.997931e-01
## 544  9.724435e-01 2.755654e-02
## 549  9.985283e-01 1.471736e-03
## 551  9.896630e-01 1.033697e-02
## 561  9.514591e-01 4.854092e-02
## 562  9.378913e-01 6.210867e-02
## 573  8.760233e-02 9.123977e-01
## 577  9.093343e-01 9.066575e-02
## 587  9.991866e-01 8.134493e-04
## 588  9.932753e-01 6.724738e-03
## 594  9.292477e-01 7.075226e-02
## 596  8.211967e-04 9.991788e-01
## 599  5.963614e-01 4.036386e-01
## 605  2.075367e-02 9.792463e-01
## 610  9.990498e-01 9.502455e-04
## 616  9.883942e-01 1.160581e-02
## 620  1.997877e-04 9.998002e-01
## 622  7.732523e-03 9.922675e-01
## 625  6.145925e-01 3.854075e-01
## 626  9.799370e-01 2.006303e-02
## 629  7.364559e-01 2.635441e-01
## 632  2.302037e-03 9.976980e-01
## 634  9.939419e-01 6.058087e-03
## 641  3.609356e-02 9.639064e-01
## 642  9.704052e-04 9.990296e-01
## 644  9.292809e-01 7.071914e-02
## 647  9.970956e-01 2.904402e-03
## 653  1.366774e-02 9.863323e-01
## 657  8.949552e-01 1.050448e-01
## 664  9.973047e-01 2.695273e-03
## 682  9.988072e-01 1.192777e-03
## 684  1.462187e-03 9.985378e-01
## 690  9.975925e-02 9.002407e-01
## 696  6.215029e-01 3.784971e-01
## 713  5.806549e-02 9.419345e-01
## 718  9.712271e-01 2.877291e-02
## 725  4.502434e-03 9.954976e-01
## 732  1.399487e-01 8.600513e-01
## 735  9.989741e-01 1.025876e-03
## 743  2.381036e-01 7.618964e-01
## 754  5.326188e-04 9.994674e-01
## 767  2.114580e-01 7.885420e-01
## 768  7.976654e-01 2.023346e-01
## 773  9.895303e-01 1.046968e-02
## 775  1.249986e-03 9.987500e-01
## 781  4.248697e-01 5.751303e-01
## 805  9.990533e-01 9.466539e-04
## 806  3.268942e-02 9.673106e-01
## 809  9.955434e-01 4.456597e-03
## 823  2.625268e-02 9.737473e-01
## 835  9.894502e-01 1.054979e-02
## 863  9.413821e-01 5.861788e-02
## 864  9.929764e-01 7.023600e-03
## 865  7.561672e-03 9.924383e-01
## 867  2.008621e-03 9.979914e-01
## 870  6.243752e-03 9.937562e-01
## 874  8.138007e-01 1.861993e-01
## 883  4.726083e-01 5.273917e-01
## 891  9.978743e-01 2.125712e-03
## 900  9.624537e-01 3.754631e-02
## 906  9.974376e-01 2.562356e-03
## 912  8.774902e-01 1.225098e-01
## 919  6.379959e-03 9.936200e-01
## 926  9.509625e-01 4.903745e-02
## 930  6.753586e-03 9.932464e-01
## 933  1.701486e-03 9.982985e-01
## 936  9.999687e-01 3.129702e-05
## 939  2.580893e-03 9.974191e-01
## 954  6.856779e-01 3.143221e-01
## 966  6.731369e-02 9.326863e-01
## 995  2.611728e-03 9.973883e-01
## 997  9.976103e-01 2.389705e-03
## 1003 2.516099e-03 9.974839e-01
## 1013 9.753896e-01 2.461043e-02
## 1015 9.883617e-01 1.163828e-02
## 1017 1.763431e-02 9.823657e-01
## 1018 3.505381e-01 6.494619e-01
## 1022 6.597022e-02 9.340298e-01
## 1024 1.340293e-02 9.865971e-01
## 1028 2.366180e-03 9.976338e-01
## 1044 8.904463e-03 9.910955e-01
## 1048 9.426095e-01 5.739050e-02
## 1053 5.834704e-01 4.165296e-01
## 1059 1.213542e-03 9.987865e-01
## 1076 9.896893e-01 1.031067e-02
## 1078 9.857162e-01 1.428383e-02
## 1080 9.599944e-01 4.000559e-02
## 1083 1.847632e-02 9.815237e-01
## 1086 9.222724e-01 7.772762e-02
## 1096 9.911931e-02 9.008807e-01
## 1099 9.991120e-01 8.880372e-04
## 1134 1.431311e-03 9.985687e-01
## 1153 4.897458e-01 5.102542e-01
## 1162 1.019704e-01 8.980296e-01
## 1163 9.992909e-01 7.090629e-04
## 1182 9.483142e-02 9.051686e-01
## 1186 9.977900e-01 2.210048e-03
## 1217 5.373391e-01 4.626609e-01
## 1223 7.298534e-01 2.701466e-01
## 1225 1.002582e-02 9.899742e-01
## 1232 9.898524e-01 1.014756e-02
## 1243 9.698644e-01 3.013557e-02
## 1250 9.943082e-01 5.691763e-03
## 1256 9.945413e-01 5.458703e-03
## 1272 5.983014e-01 4.016986e-01
## 1275 9.994212e-01 5.787550e-04
## 1277 8.183364e-01 1.816636e-01
## 1279 7.304271e-04 9.992696e-01
## 1287 9.311929e-01 6.880711e-02
## 1289 2.121189e-03 9.978788e-01
## 1301 2.568067e-03 9.974319e-01
## 1330 1.157762e-01 8.842238e-01
## 1339 8.142460e-03 9.918575e-01
## 1349 1.147212e-02 9.885279e-01
## 1361 3.442294e-01 6.557706e-01
## 1367 8.917968e-03 9.910820e-01
## 1370 6.846544e-04 9.993153e-01
## 1376 6.510773e-03 9.934892e-01
## 1378 9.724688e-01 2.753122e-02
## 1383 9.465195e-01 5.348054e-02
## 1391 1.461918e-02 9.853808e-01
## 1399 9.004292e-01 9.957081e-02
## 1413 9.993880e-01 6.119850e-04
## 1419 9.972793e-01 2.720656e-03
## 1422 9.719974e-01 2.800264e-02
## 1427 2.170844e-03 9.978292e-01
## 1428 9.521519e-01 4.784806e-02
## 
## $x
##              LD1
## 14    1.18531630
## 29   -0.15748787
## 41   -1.15366004
## 45   -1.54465436
## 53   -2.60225814
## 55   -1.44336574
## 58    1.46817570
## 72   -1.90416691
## 74   -1.70790844
## 77   -2.15536337
## 78   -1.90247700
## 84   -2.39243613
## 85    0.15343576
## 98   -2.21874838
## 102   0.84429429
## 103  -1.45966485
## 104   0.07402211
## 108  -2.03850899
## 111  -0.55095592
## 131   0.89138452
## 134   1.28336926
## 142   1.94113855
## 143  -2.02783485
## 157  -1.68277537
## 163   1.27066883
## 168   2.68188717
## 187  -0.28098032
## 189  -1.77999093
## 193   1.29157859
## 200   2.19846687
## 205  -1.79393045
## 213   1.60803714
## 214  -0.63346796
## 216  -1.38980796
## 220   0.84171909
## 221   1.37507472
## 239   1.24514907
## 241   2.51526132
## 260  -2.51387344
## 271   2.69438710
## 274  -0.36406828
## 275  -1.52076182
## 276  -0.06701815
## 289  -2.05104025
## 290  -1.71117179
## 295  -0.63877919
## 299   0.52475080
## 304  -1.56884833
## 317   2.40097541
## 337   2.94835984
## 339   1.47937835
## 353  -1.98694195
## 356  -0.12129403
## 362  -1.65410054
## 373  -1.52600526
## 375   2.02447781
## 386   1.35038768
## 397  -1.76878945
## 410   2.95427434
## 415   2.30646138
## 424   3.41324387
## 425  -1.35033826
## 436   2.01171861
## 445   1.65658491
## 446  -0.81129712
## 450  -2.65212735
## 467  -0.75239213
## 470   0.96748057
## 471   1.05927403
## 472   0.70472839
## 476  -1.90539009
## 479   2.27191430
## 486  -0.94774108
## 516   2.59710888
## 523  -0.82036569
## 526   1.62655729
## 533  -1.94026219
## 542   2.91582181
## 544  -1.22488839
## 549  -2.24103113
## 551  -1.56794852
## 561  -1.02278412
## 562  -0.93312661
## 573   0.80543961
## 577  -0.79246909
## 587  -2.44505708
## 588  -1.71697961
## 594  -0.88515719
## 596   2.44179622
## 599  -0.13416503
## 605   1.32473630
## 610  -2.39158255
## 616  -1.52771117
## 620   2.92786756
## 622   1.66863218
## 625  -0.16040211
## 626  -1.33661179
## 629  -0.35322154
## 632   2.08697934
## 634  -1.75309476
## 641   1.12909492
## 642   2.38435967
## 644  -0.88533039
## 647  -2.00687865
## 653   1.47078323
## 657  -0.73639154
## 664  -2.03263665
## 682  -2.31336373
## 684   2.24327191
## 690   0.75616119
## 696  -0.17046417
## 713   0.95774272
## 718  -1.20961117
## 725   1.85564158
## 732   0.62410689
## 735  -2.36523343
## 743   0.39978796
## 754   2.59071348
## 767   0.45239671
## 768  -0.47151027
## 773  -1.56351739
## 775   2.29724127
## 781   0.10408476
## 805  -2.39288540
## 806   1.16435710
## 809  -1.85917459
## 823   1.24200987
## 835  -1.56086975
## 863  -0.95428661
## 864  -1.70193001
## 865   1.67637117
## 867   2.13394595
## 870   1.74265443
## 874  -0.50695899
## 883   0.03769852
## 891  -2.11443063
## 900  -1.11501290
## 906  -2.05006546
## 912  -0.67675110
## 919   1.73518957
## 926  -1.01910656
## 930   1.71549826
## 933   2.19109159
## 936  -3.56509947
## 939   2.04758136
## 954  -0.26810190
## 966   0.90355143
## 995   2.04348841
## 997  -2.07410216
## 1003  2.05634307
## 1013 -1.26479294
## 1015 -1.52673963
## 1017  1.38181435
## 1018  0.21196643
## 1022  0.91097571
## 1024  1.47760055
## 1028  2.07751082
## 1044  1.61972044
## 1048 -0.96200804
## 1053 -0.11584779
## 1059  2.30742434
## 1076 -1.56883319
## 1078 -1.45541371
## 1080 -1.09232612
## 1083  1.36548704
## 1086 -0.85024814
## 1096  0.75861751
## 1099 -2.41487642
## 1134  2.25061838
## 1153  0.01410041
## 1162  0.74778051
## 1163 -2.49229995
## 1182  0.77545035
## 1186 -2.10102814
## 1217 -0.05143321
## 1223 -0.34162089
## 1225  1.57856179
## 1232 -1.57437095
## 1243 -1.19322376
## 1250 -1.77466090
## 1256 -1.78911218
## 1272 -0.13693750
## 1275 -2.56214323
## 1277 -0.51734606
## 1279  2.48208906
## 1287 -0.89545809
## 1289  2.11516427
## 1301  2.04929820
## 1330  0.69881031
## 1339  1.65073429
## 1349  1.53174019
## 1361  0.22153162
## 1367  1.61919482
## 1370  2.50434905
## 1376  1.72816793
## 1378 -1.22521328
## 1383 -0.98768440
## 1391  1.44732034
## 1399 -0.75688304
## 1413 -2.54294219
## 1419 -2.02940598
## 1422 -1.21921081
## 1427  2.10719364
## 1428 -1.02797597

The confusionmatrix function of the biotools package performs cross-validation of the classification model.

pred <- predict(object = modelo_lda, newdata = datos_test)
confusionmatrix(datos_test$PriceCategory, pred$class)
##           new Cheap new Expensive
## Cheap           100             2
## Expensive        10            92
# Classification error percentage
trainig_error <- mean(datos_test$PriceCategory != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 5.88235294117647 %"

The correct classifications rate is 94.11765%.

From a geometric point of view, linear discriminant analysis separates space using a straight line.

3.6.1 ROC curve

pred <- predict(object = modelo_lda, newdata = datos_test, type = "response")

# Computela ROC curve
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))

# Plot ROC curve
plot(roc_curve, main="ROC curve for LDA", col="#1c61b6")

# Add diagonal line
abline(0, 1, lty=2, col = "red")

auc(roc_curve)
## Area under the curve: 0.9865

3.6.2 Validity measures

# Compute confusion Matrix
conf_mat <- confusionMatrix(pred$class, datos_test$PriceCategory)

# Extract values TP, FP, TN, FN
TP <- conf_mat$table[2,2]
FP <- conf_mat$table[1,2]
TN <- conf_mat$table[1,1]
FN <- conf_mat$table[2,1]

# Compute metrics
PPV <- TP / (TP + FP)  # Positive Predictive Value
TPR <- TP / (TP + FN)  # True Positive Rate o Sensitivity
TNR <- TN / (TN + FP)  # True Negative Rate o Specificity
F1 <- 2 * PPV * TPR / (PPV + TPR)  # F1 Score
G_mean <- sqrt(TPR * TNR)  # G-mean
Accuracy <- (TP + TN) / (TP + FP + FN + TN)  # Accuracy

# Compute AUC
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
AUC <- auc(roc_curve)

# Print metrics
cat("PPV:", PPV, "\n",
    "TPR:", TPR, "\n",
    "TNR:", TNR, "\n",
    "F1-score:", F1, "\n",
    "G-mean:", G_mean, "\n",
    "Accuracy:", Accuracy, "\n",
    "AUC:", AUC, "\n")
## PPV: 0.9019608 
##  TPR: 0.9787234 
##  TNR: 0.9090909 
##  F1-score: 0.9387755 
##  G-mean: 0.9432648 
##  Accuracy: 0.9411765 
##  AUC: 0.9865436

3.7 Classifier using quadratic discriminant analysis

Although the assumption of multivariate normality is not verified, taking into account that the variances are not homogeneous, a quadratic discriminant model is adjusted because it is robust against the lack of this assumption, although it must be kept in mind given the possibility of obtaining unexpected results.

The qda function of the MASS package performs the sorting.

modelo_qda <- qda(PriceCategory ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold ,data = datos_train)
modelo_qda
## Call:
## qda(PriceCategory ~ LotFrontage + LotArea + OverallQual + OverallCond + 
##     YearBuilt + GrLivArea + X1stFlrSF + X2ndFlrSF + YrSold, data = datos_train)
## 
## Prior probabilities of groups:
##     Cheap Expensive 
##       0.5       0.5 
## 
## Group means:
##           LotFrontage   LotArea OverallQual OverallCond YearBuilt GrLivArea
## Cheap        67.79471  8880.846    5.276961    5.664216  1959.056  1234.218
## Expensive    73.35384 10205.684    7.132353    5.235294  1995.265  1758.924
##           X1stFlrSF X2ndFlrSF   YrSold
## Cheap      1054.777  172.2819 2007.782
## Expensive  1298.103  459.8799 2007.811

We haven’t added the categorical values to the model because it threw an error (rank deficiency in group Cheap).

The output of this object shows us the prior probabilities of each group, in this case 0.5 and the means of each regressor per group.

Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function. We are going to classify all the observations in the test dataset.

nuevas_observaciones <- datos_test
predict(object = modelo_qda, newdata = nuevas_observaciones)
## $class
##   [1] Expensive Expensive Cheap     Cheap     Cheap     Cheap     Expensive
##   [8] Cheap     Cheap     Cheap     Cheap     Cheap     Expensive Cheap    
##  [15] Expensive Cheap     Expensive Cheap     Cheap     Expensive Expensive
##  [22] Expensive Cheap     Cheap     Expensive Expensive Expensive Cheap    
##  [29] Expensive Expensive Cheap     Expensive Cheap     Cheap     Expensive
##  [36] Expensive Expensive Expensive Cheap     Expensive Expensive Cheap    
##  [43] Cheap     Cheap     Cheap     Cheap     Expensive Cheap     Expensive
##  [50] Expensive Expensive Cheap     Expensive Cheap     Cheap     Expensive
##  [57] Expensive Cheap     Expensive Expensive Expensive Cheap     Expensive
##  [64] Expensive Cheap     Cheap     Cheap     Expensive Expensive Expensive
##  [71] Cheap     Expensive Cheap     Expensive Cheap     Expensive Cheap    
##  [78] Expensive Expensive Cheap     Cheap     Cheap     Cheap     Expensive
##  [85] Cheap     Cheap     Cheap     Cheap     Expensive Expensive Expensive
##  [92] Cheap     Cheap     Expensive Expensive Expensive Cheap     Expensive
##  [99] Expensive Cheap     Expensive Expensive Expensive Cheap     Expensive
## [106] Cheap     Cheap     Cheap     Expensive Expensive Expensive Expensive
## [113] Cheap     Expensive Expensive Cheap     Expensive Expensive Expensive
## [120] Cheap     Cheap     Expensive Expensive Cheap     Expensive Cheap    
## [127] Expensive Cheap     Cheap     Cheap     Expensive Expensive Expensive
## [134] Cheap     Expensive Cheap     Cheap     Cheap     Cheap     Expensive
## [141] Cheap     Expensive Expensive Cheap     Expensive Cheap     Expensive
## [148] Expensive Cheap     Expensive Cheap     Cheap     Expensive Expensive
## [155] Expensive Expensive Expensive Expensive Cheap     Expensive Expensive
## [162] Cheap     Cheap     Cheap     Expensive Cheap     Expensive Cheap    
## [169] Expensive Expensive Expensive Cheap     Expensive Cheap     Expensive
## [176] Cheap     Expensive Cheap     Expensive Cheap     Cheap     Cheap    
## [183] Cheap     Expensive Expensive Cheap     Expensive Expensive Expensive
## [190] Expensive Expensive Cheap     Expensive Expensive Expensive Cheap    
## [197] Cheap     Expensive Cheap     Cheap     Cheap     Cheap     Expensive
## [204] Cheap    
## Levels: Cheap Expensive
## 
## $posterior
##             Cheap    Expensive
## 14   1.598651e-02 9.840135e-01
## 29   6.185334e-02 9.381467e-01
## 41   9.618183e-01 3.818166e-02
## 45   9.597117e-01 4.028826e-02
## 53   9.996636e-01 3.363617e-04
## 55   9.969865e-01 3.013546e-03
## 58   1.072621e-03 9.989274e-01
## 72   9.755882e-01 2.441185e-02
## 74   9.926985e-01 7.301516e-03
## 77   9.961080e-01 3.892016e-03
## 78   9.996536e-01 3.463523e-04
## 84   9.991322e-01 8.678486e-04
## 85   1.758521e-02 9.824148e-01
## 98   9.993453e-01 6.547191e-04
## 102  4.098831e-02 9.590117e-01
## 103  6.448357e-01 3.551643e-01
## 104  5.150332e-02 9.484967e-01
## 108  9.999838e-01 1.619422e-05
## 111  9.999890e-01 1.102633e-05
## 131  1.214795e-03 9.987852e-01
## 134  4.243563e-03 9.957564e-01
## 142  4.547363e-03 9.954526e-01
## 143  9.999998e-01 1.624727e-07
## 157  9.935918e-01 6.408201e-03
## 163  8.492884e-03 9.915071e-01
## 168  1.437689e-04 9.998562e-01
## 187  1.009656e-01 8.990344e-01
## 189  8.583566e-01 1.416434e-01
## 193  1.092336e-02 9.890766e-01
## 200  2.016288e-04 9.997984e-01
## 205  9.928382e-01 7.161788e-03
## 213  3.625303e-04 9.996375e-01
## 214  5.554602e-01 4.445398e-01
## 216  9.353098e-01 6.469019e-02
## 220  3.035077e-03 9.969649e-01
## 221  1.228220e-02 9.877178e-01
## 239  3.077326e-04 9.996923e-01
## 241  3.595329e-04 9.996405e-01
## 260  9.997670e-01 2.330396e-04
## 271  1.981789e-04 9.998018e-01
## 274  4.393129e-01 5.606871e-01
## 275  9.993106e-01 6.893748e-04
## 276  9.999430e-01 5.697134e-05
## 289  9.982319e-01 1.768139e-03
## 290  9.999873e-01 1.271531e-05
## 295  9.836750e-01 1.632497e-02
## 299  5.945010e-02 9.405499e-01
## 304  9.966317e-01 3.368294e-03
## 317  2.699865e-03 9.973001e-01
## 337  1.983135e-06 9.999980e-01
## 339  3.219318e-01 6.780682e-01
## 353  9.983393e-01 1.660654e-03
## 356  4.951785e-01 5.048215e-01
## 362  9.999694e-01 3.061819e-05
## 373  8.305816e-01 1.694184e-01
## 375  1.730682e-04 9.998269e-01
## 386  2.919351e-04 9.997081e-01
## 397  9.899145e-01 1.008555e-02
## 410  2.953204e-05 9.999705e-01
## 415  2.766892e-04 9.997233e-01
## 424  2.508788e-05 9.999749e-01
## 425  9.976045e-01 2.395498e-03
## 436  1.528628e-03 9.984714e-01
## 445  9.873844e-04 9.990126e-01
## 446  9.496542e-01 5.034577e-02
## 450  8.565866e-01 1.434134e-01
## 467  7.168933e-01 2.831067e-01
## 470  6.254450e-03 9.937455e-01
## 471  5.809803e-03 9.941902e-01
## 472  4.004270e-03 9.959957e-01
## 476  9.849711e-01 1.502888e-02
## 479  1.840944e-04 9.998159e-01
## 486  9.928581e-01 7.141913e-03
## 516  3.390832e-08 1.000000e+00
## 523  9.765542e-01 2.344580e-02
## 526  2.717020e-02 9.728298e-01
## 533  9.980792e-01 1.920827e-03
## 542  6.957149e-06 9.999930e-01
## 544  7.135313e-02 9.286469e-01
## 549  9.931445e-01 6.855472e-03
## 551  5.829181e-01 4.170819e-01
## 561  8.749257e-01 1.250743e-01
## 562  9.246905e-01 7.530952e-02
## 573  5.329391e-04 9.994671e-01
## 577  9.999536e-01 4.635774e-05
## 587  9.999883e-01 1.168584e-05
## 588  9.642178e-01 3.578216e-02
## 594  5.383447e-01 4.616553e-01
## 596  1.227445e-04 9.998773e-01
## 599  3.660727e-01 6.339273e-01
## 605  1.323418e-02 9.867658e-01
## 610  9.990382e-01 9.617520e-04
## 616  9.916315e-01 8.368457e-03
## 620  5.550637e-05 9.999445e-01
## 622  3.740128e-02 9.625987e-01
## 625  1.956842e-01 8.043158e-01
## 626  9.270346e-01 7.296541e-02
## 629  5.691767e-02 9.430823e-01
## 632  3.080660e-05 9.999692e-01
## 634  9.935470e-01 6.452960e-03
## 641  2.651247e-04 9.997349e-01
## 642  1.028925e-03 9.989711e-01
## 644  3.890896e-01 6.109104e-01
## 647  9.998947e-01 1.053373e-04
## 653  3.085258e-03 9.969147e-01
## 657  9.883382e-01 1.166176e-02
## 664  9.959742e-01 4.025809e-03
## 682  9.999993e-01 6.734452e-07
## 684  3.032820e-05 9.999697e-01
## 690  6.963912e-02 9.303609e-01
## 696  2.122048e-01 7.877952e-01
## 713  5.057171e-03 9.949428e-01
## 718  8.727371e-01 1.272629e-01
## 725  7.302186e-06 9.999927e-01
## 732  1.029524e-01 8.970476e-01
## 735  9.972914e-01 2.708645e-03
## 743  2.001277e-02 9.799872e-01
## 754  4.674900e-05 9.999533e-01
## 767  9.648524e-03 9.903515e-01
## 768  7.017834e-01 2.982166e-01
## 773  9.218587e-01 7.814129e-02
## 775  2.648684e-05 9.999735e-01
## 781  4.746499e-02 9.525350e-01
## 805  9.996526e-01 3.473942e-04
## 806  2.121184e-02 9.787882e-01
## 809  9.956607e-01 4.339348e-03
## 823  6.637024e-04 9.993363e-01
## 835  9.603370e-01 3.966300e-02
## 863  7.787805e-01 2.212195e-01
## 864  9.985762e-01 1.423775e-03
## 865  1.590206e-02 9.840979e-01
## 867  2.695107e-04 9.997305e-01
## 870  1.321993e-03 9.986780e-01
## 874  1.000000e+00 6.070884e-30
## 883  1.642622e-02 9.835738e-01
## 891  9.978902e-01 2.109818e-03
## 900  9.700519e-01 2.994807e-02
## 906  9.998208e-01 1.791535e-04
## 912  9.812232e-01 1.877677e-02
## 919  1.447299e-05 9.999855e-01
## 926  7.093340e-01 2.906660e-01
## 930  1.953564e-05 9.999805e-01
## 933  4.071608e-06 9.999959e-01
## 936  1.000000e+00 3.115341e-09
## 939  2.439865e-03 9.975601e-01
## 954  9.979050e-01 2.094981e-03
## 966  1.302896e-02 9.869710e-01
## 995  3.864569e-07 9.999996e-01
## 997  9.650521e-01 3.494788e-02
## 1003 4.110950e-04 9.995889e-01
## 1013 9.986226e-01 1.377412e-03
## 1015 9.996176e-01 3.824138e-04
## 1017 9.398848e-03 9.906012e-01
## 1018 2.671632e-02 9.732837e-01
## 1022 4.024673e-02 9.597533e-01
## 1024 4.414345e-04 9.995586e-01
## 1028 2.771546e-04 9.997228e-01
## 1044 9.842674e-04 9.990157e-01
## 1048 7.704019e-01 2.295981e-01
## 1053 2.761629e-02 9.723837e-01
## 1059 2.260297e-06 9.999977e-01
## 1076 8.321756e-01 1.678244e-01
## 1078 9.825305e-01 1.746945e-02
## 1080 8.680819e-01 1.319181e-01
## 1083 1.212824e-02 9.878718e-01
## 1086 7.806956e-01 2.193044e-01
## 1096 2.359656e-01 7.640344e-01
## 1099 9.980734e-01 1.926609e-03
## 1134 3.040128e-04 9.996960e-01
## 1153 1.003761e-01 8.996239e-01
## 1162 1.532450e-01 8.467550e-01
## 1163 9.981806e-01 1.819409e-03
## 1182 1.694997e-04 9.998305e-01
## 1186 9.994157e-01 5.843338e-04
## 1217 1.647274e-01 8.352726e-01
## 1223 6.578686e-01 3.421314e-01
## 1225 8.941752e-04 9.991058e-01
## 1232 9.277972e-01 7.220283e-02
## 1243 3.966838e-01 6.033162e-01
## 1250 9.956461e-01 4.353921e-03
## 1256 9.999212e-01 7.884575e-05
## 1272 8.214574e-01 1.785426e-01
## 1275 1.000000e+00 1.620504e-08
## 1277 1.115804e-01 8.884196e-01
## 1279 1.520985e-04 9.998479e-01
## 1287 9.640120e-01 3.598799e-02
## 1289 3.582491e-05 9.999642e-01
## 1301 3.078849e-04 9.996921e-01
## 1330 7.085867e-03 9.929141e-01
## 1339 2.295897e-04 9.997704e-01
## 1349 5.796376e-03 9.942036e-01
## 1361 8.950806e-01 1.049194e-01
## 1367 1.000122e-03 9.989999e-01
## 1370 3.290683e-05 9.999671e-01
## 1376 7.784750e-04 9.992215e-01
## 1378 9.998202e-01 1.797535e-04
## 1383 9.999636e-01 3.643660e-05
## 1391 1.088589e-02 9.891141e-01
## 1399 9.999990e-01 9.822073e-07
## 1413 9.999575e-01 4.246619e-05
## 1419 9.962932e-01 3.706818e-03
## 1422 6.991269e-01 3.008731e-01
## 1427 6.509543e-04 9.993490e-01
## 1428 9.307370e-01 6.926301e-02

The confusionmatrix function of the biotools package performs cross-validation of the classification model.

pred <- predict(object = modelo_qda, newdata = datos_test)
confusionmatrix(datos_test$PriceCategory, pred$class)
##           new Cheap new Expensive
## Cheap            91            11
## Expensive         6            96
# Classification error percentage
trainig_error <- mean(datos_test$PriceCategory != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 8.33333333333333 %"

In this case the correct classifications rate is 91.66667%.

3.7.1 ROC curve

pred <- predict(object = modelo_qda, newdata = datos_test, type = "response")

# Compute ROC curve
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))

# Plot ROC curve
plot(roc_curve, main="ROC curve for QDA", col="#1c61b6")

# Add diagonal line
abline(0, 1, lty=2, col = "red")

auc(roc_curve)
## Area under the curve: 0.9648

3.7.2 Validity measures

# Compute confusion Matrix
conf_mat <- confusionMatrix(pred$class, datos_test$PriceCategory)

# Extract values TP, FP, TN, FN
TP <- conf_mat$table[2,2]
FP <- conf_mat$table[1,2]
TN <- conf_mat$table[1,1]
FN <- conf_mat$table[2,1]

# Compute metrics
PPV <- TP / (TP + FP)  # Positive Predictive Value
TPR <- TP / (TP + FN)  # True Positive Rate o Sensitivity
TNR <- TN / (TN + FP)  # True Negative Rate o Specificity
F1 <- 2 * PPV * TPR / (PPV + TPR)  # F1 Score
G_mean <- sqrt(TPR * TNR)  # G-mean
Accuracy <- (TP + TN) / (TP + FP + FN + TN)  # Accuracy

# Compute AUC
roc_curve <- roc(datos_test$PriceCategory, as.numeric(pred$posterior[,2]), levels = c("Cheap", "Expensive"))
AUC <- auc(roc_curve)

# Print metrics
cat("PPV:", PPV, "\n",
    "TPR:", TPR, "\n",
    "TNR:", TNR, "\n",
    "F1-score:", F1, "\n",
    "G-mean:", G_mean, "\n",
    "Accuracy:", Accuracy, "\n",
    "AUC:", AUC, "\n")
## PPV: 0.9411765 
##  TPR: 0.8971963 
##  TNR: 0.9381443 
##  F1-score: 0.9186603 
##  G-mean: 0.9174419 
##  Accuracy: 0.9166667 
##  AUC: 0.9648212

3.8 Cluster Analysis

distance<- get_dist(numeric_data)
fviz_dist(distance, gradient = list(low ="yellow", mid = "white", high = "brown"))

3.8.1 Hierarchical clustering: Ward’s method

Hierarchical clustering is interested in finding a hierarchy based on the closeness or similarity of the data according to the distance considered. In the agglomerative case, we start from a group with the closest observations. The next closest pairs are then calculated and groups are generated in an ascending manner. This construction can be observed visually by means of a dendrogram.

Below it will be illustrated how the groups are defined by the number of vertical lines in the dendrogram, and the selection of the optimal number of groups can be estimated from this same graph.

dendrogram <- hclust(dist(numeric_data, method = 'euclidean'), method = 'ward.D')
ggdendrogram(dendrogram, rotate = FALSE, labels = FALSE, theme_dendro = TRUE) + 
  labs(title = "Dendrogram")

3.8.2 Non-hierarchical clustering: K-means algorithm

k2 <- kmeans(numeric_data, centers = 2, nstart = 25)

# Displaying all the fields of the object k2
str(k2)
## List of 9
##  $ cluster     : Named int [1:1020] 2 1 2 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:1020] "1" "3" "4" "5" ...
##  $ centers     : num [1:2, 1:10] 43.2 54.4 77.6 65.3 11899.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:10] "MSSubClass" "LotFrontage" "LotArea" "OverallQual" ...
##  $ totss       : num 7.83e+09
##  $ withinss    : num [1:2] 1.5e+09 1.8e+09
##  $ tot.withinss: num 3.3e+09
##  $ betweenss   : num 4.53e+09
##  $ size        : int [1:2] 454 566
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

The output provided by the kmeans function is a list of information, including the following:

  • cluster: is a vector of integers, from 1 to K (K=2 in this case), which indicates the cluster in which each observation has been located.
  • centers: a matrix with the successive centers of the clusters.
  • totss: the total sum of squares.
  • withinss: sum of squares vector within each cluster (one component per cluster).
  • tot.withinss: total sum of squares of the clusters, i.e. sum(withinss).
  • betweens: sum of squares between groups, i.e. totss-tot.withinss.
  • size: the number of observations in each cluster.

When displaying the variable k2 it is seen how the groupings result in 2.

k2
## K-means clustering with 2 clusters of sizes 454, 566
## 
## Cluster means:
##   MSSubClass LotFrontage   LotArea OverallQual OverallCond YearBuilt GrLivArea
## 1   43.24890    77.63516 11899.641    6.550661    5.407489  1981.295  1671.586
## 2   54.37279    65.31116  7678.396    5.948763    5.489399  1973.949  1345.067
##   X1stFlrSF X2ndFlrSF   YrSold
## 1  1278.601  389.3062 2007.744
## 2  1093.965  247.8039 2007.850
## 
## Clustering vector:
##    1    3    4    5    6    7    8   11   12   13   14   15   17   18   19   20 
##    2    1    2    1    1    1    1    1    1    1    1    1    1    1    1    2 
##   21   22   23   24   26   27   28   29   31   32   33   34   35   36   38   39 
##    1    2    2    2    1    2    1    1    2    2    1    1    2    1    2    2 
##   41   43   44   45   46   47   48   50   51   52   53   55   56   58   60   61 
##    2    2    2    2    2    1    1    2    1    2    2    2    1    1    2    1 
##   62   63   64   65   66   68   69   70   72   73   74   75   77   78   80   81 
##    2    2    1    2    2    1    2    1    2    1    1    2    2    2    1    1 
##   82   83   84   85   91   93   95   97   98  101  102  103  104  105  106  108 
##    2    1    2    2    2    1    2    1    1    1    2    2    1    2    2    2 
##  110  111  112  113  117  118  120  122  123  124  127  129  130  131  132  133 
##    1    2    2    1    1    2    2    2    2    2    2    2    2    1    1    2 
##  134  135  136  137  138  139  140  142  143  144  145  148  152  153  154  157 
##    2    1    1    1    1    2    1    1    2    1    2    2    1    1    1    2 
##  158  159  161  162  163  165  167  168  169  170  171  174  175  177  178  183 
##    1    1    1    1    1    2    1    1    2    1    1    1    1    1    1    2 
##  184  187  189  190  193  195  197  200  201  202  203  204  205  206  207  208 
##    1    1    2    2    2    2    2    2    2    1    2    2    2    1    1    1 
##  209  210  212  213  214  215  216  217  218  220  221  222  223  224  227  229 
##    1    2    1    2    1    1    1    2    1    2    2    2    1    1    1    2 
##  230  231  234  235  237  238  239  240  241  245  246  248  249  252  253  254 
##    2    2    1    2    2    2    1    2    2    2    1    1    1    2    2    2 
##  255  256  257  258  259  260  262  263  264  265  266  267  270  271  273  274 
##    2    2    2    2    1    1    2    2    2    2    1    1    2    1    1    2 
##  275  276  279  280  281  282  283  284  285  287  289  290  293  294  295  296 
##    2    2    1    1    1    2    2    2    2    1    2    2    1    1    2    2 
##  297  298  299  302  304  306  309  310  311  312  316  317  318  319  320  322 
##    1    2    1    1    2    1    1    1    2    2    2    1    2    1    1    1 
##  326  327  328  329  331  332  333  334  335  337  338  339  340  341  342  343 
##    2    1    1    1    1    2    1    2    2    1    2    1    1    1    2    2 
##  344  346  351  352  353  355  356  357  358  359  360  361  362  363  366  367 
##    2    2    2    2    2    2    1    2    2    2    1    2    2    2    1    2 
##  368  369  370  371  373  374  375  377  378  380  381  382  383  386  388  389 
##    2    2    1    2    2    1    2    2    1    2    2    2    2    2    2    2 
##  390  392  393  395  396  397  398  400  401  402  404  405  406  407  408  409 
##    1    1    2    1    2    2    2    2    1    2    1    1    1    1    1    1 
##  410  413  414  415  416  417  419  420  421  422  424  425  427  428  429  434 
##    1    2    2    1    2    2    2    2    2    1    2    2    1    2    2    1 
##  436  438  443  444  445  446  448  449  450  453  454  455  456  459  460  461 
##    1    2    2    2    2    1    1    2    2    2    2    2    2    2    2    2 
##  463  464  466  467  468  469  470  471  472  474  475  476  477  478  479  480 
##    2    1    2    1    2    1    2    2    1    1    2    2    1    1    1    2 
##  481  482  484  485  486  487  488  492  493  494  498  499  500  502  506  507 
##    1    1    2    2    2    1    1    2    1    2    2    2    2    1    2    2 
##  508  510  511  512  513  514  516  517  518  519  522  523  525  526  527  531 
##    2    2    1    2    2    2    1    1    1    2    1    2    1    2    1    1 
##  533  535  537  538  539  540  541  542  543  544  546  547  548  549  550  551 
##    2    2    2    1    1    1    1    1    1    2    1    2    2    2    2    2 
##  552  553  554  555  556  557  558  560  561  562  565  566  567  568  570  571 
##    2    1    2    1    2    1    1    2    1    1    1    2    1    1    2    1 
##  572  573  574  575  576  577  578  581  582  585  586  587  588  590  591  592 
##    2    1    1    1    2    2    1    1    1    2    1    1    2    2    2    1 
##  594  595  596  597  598  599  601  602  603  605  606  607  610  611  612  613 
##    2    2    1    2    2    1    1    2    1    1    1    1    2    1    1    1 
##  616  617  618  619  620  622  623  625  626  627  628  629  630  632  633  634 
##    2    2    2    1    1    1    2    1    1    1    2    1    2    2    1    2 
##  635  640  641  642  644  645  646  647  648  649  651  653  654  655  657  658 
##    2    2    1    2    1    2    1    2    1    2    2    2    1    1    1    2 
##  660  661  664  666  668  669  670  671  672  673  674  675  679  680  681  682 
##    1    1    1    1    2    1    1    2    2    1    1    2    1    1    2    2 
##  683  684  685  687  689  690  691  694  695  696  697  698  700  701  702  703 
##    2    1    1    1    2    2    2    2    2    1    2    2    2    1    2    1 
##  705  708  709  710  712  713  716  718  719  720  721  722  723  724  725  726 
##    2    2    2    2    2    2    1    1    1    1    2    2    2    2    1    2 
##  728  729  730  731  732  733  734  735  736  737  738  740  741  743  744  745 
##    2    1    2    2    2    1    1    2    1    2    1    2    2    2    1    2 
##  747  749  752  753  754  757  758  760  762  763  764  765  766  767  768  769 
##    2    1    2    2    1    1    1    1    2    2    2    2    1    1    1    2 
##  771  772  773  774  775  776  777  778  779  780  781  782  783  784  786  787 
##    2    2    2    1    1    2    1    1    2    1    2    2    1    2    2    1 
##  788  791  792  793  794  795  796  797  798  800  801  802  803  805  806  807 
##    1    2    1    1    2    1    2    2    2    2    1    2    2    2    1    2 
##  809  811  812  813  814  815  816  817  818  819  820  821  822  823  824  825 
##    1    1    2    2    2    2    1    1    1    2    2    2    2    1    1    1 
##  828  831  833  834  835  836  837  839  840  846  847  848  850  851  852  853 
##    2    1    2    1    2    2    2    2    1    1    2    1    2    2    2    2 
##  854  857  858  859  860  863  864  865  866  867  868  869  870  871  872  873 
##    1    1    2    1    1    2    2    2    2    1    2    1    1    2    2    2 
##  874  875  876  878  879  881  882  883  884  885  886  887  888  891  892  894 
##    1    2    2    2    1    2    1    2    2    2    2    2    1    2    1    1 
##  895  896  897  898  900  901  902  903  904  905  906  909  911  912  913  914 
##    2    2    2    2    2    2    2    2    1    2    1    2    1    2    2    2 
##  919  921  923  924  925  926  927  928  929  930  931  932  933  934  936  937 
##    1    2    1    2    1    1    1    1    1    1    2    2    1    2    2    1 
##  938  939  941  942  945  946  947  948  949  950  952  954  956  958  959  965 
##    2    2    1    2    1    2    2    1    1    2    2    1    2    2    2    1 
##  966  968  973  974  978  979  980  982  983  984  985  988  989  990  991  994 
##    1    2    2    1    2    2    2    1    2    1    1    1    1    2    2    2 
##  995  996  997  998  999 1000 1003 1004 1005 1009 1013 1014 1015 1016 1017 1018 
##    1    2    1    1    2    2    1    1    2    1    1    2    1    2    1    2 
## 1019 1020 1021 1022 1024 1026 1027 1028 1029 1033 1034 1037 1038 1041 1043 1044 
##    1    2    2    2    2    2    2    2    2    1    2    1    2    1    2    1 
## 1046 1048 1050 1051 1052 1053 1054 1055 1056 1057 1059 1060 1061 1064 1065 1066 
##    1    2    1    2    1    2    2    1    1    2    1    1    2    2    1    1 
## 1067 1068 1070 1071 1072 1074 1075 1076 1078 1079 1080 1081 1082 1083 1084 1085 
##    2    2    2    1    1    2    2    1    1    2    2    1    2    2    2    1 
## 1086 1088 1090 1091 1093 1095 1096 1098 1099 1100 1101 1102 1103 1104 1106 1109 
##    2    1    2    2    2    2    2    2    2    1    2    2    2    2    1    2 
## 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1125 1127 1129 
##    1    2    1    2    2    2    1    2    2    1    2    2    1    2    2    1 
## 1130 1134 1135 1136 1137 1139 1140 1141 1142 1146 1147 1148 1149 1151 1153 1155 
##    2    1    2    2    2    1    2    2    1    2    1    1    2    2    1    1 
## 1158 1159 1162 1163 1165 1166 1167 1168 1171 1172 1177 1181 1182 1184 1186 1188 
##    2    1    1    2    1    2    1    1    1    2    2    1    2    1    2    1 
## 1189 1190 1194 1195 1196 1197 1198 1199 1200 1201 1202 1204 1205 1207 1208 1209 
##    2    2    2    2    2    1    2    2    1    2    1    2    1    2    2    2 
## 1210 1211 1213 1215 1216 1217 1218 1221 1222 1223 1225 1227 1229 1230 1232 1233 
##    1    1    2    1    2    2    2    2    2    1    1    1    2    2    2    1 
## 1234 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1250 1251 1252 1253 
##    1    1    1    2    2    1    1    1    1    1    2    1    2    1    2    2 
## 1255 1256 1257 1259 1262 1263 1264 1265 1268 1270 1272 1273 1275 1276 1277 1279 
##    2    2    1    2    2    1    1    2    1    1    2    1    2    1    1    2 
## 1280 1281 1283 1285 1286 1287 1289 1290 1291 1293 1294 1295 1296 1297 1300 1301 
##    2    1    2    2    2    2    2    1    1    2    1    2    2    2    2    1 
## 1302 1303 1304 1306 1307 1308 1309 1310 1312 1314 1315 1316 1317 1318 1319 1320 
##    2    1    2    1    2    2    1    2    2    1    2    1    1    2    1    1 
## 1322 1323 1325 1330 1331 1332 1334 1336 1337 1339 1341 1342 1343 1345 1346 1348 
##    2    1    1    2    1    1    2    2    2    1    2    1    2    1    2    1 
## 1349 1351 1352 1355 1356 1357 1358 1361 1363 1364 1366 1367 1369 1370 1371 1372 
##    1    1    2    1    1    2    1    1    1    2    2    2    2    1    2    2 
## 1373 1375 1376 1378 1380 1382 1383 1385 1388 1389 1390 1391 1392 1393 1395 1396 
##    1    1    1    1    2    1    2    2    2    1    2    2    2    2    2    1 
## 1399 1401 1402 1403 1404 1405 1406 1407 1409 1411 1413 1414 1415 1416 1418 1419 
##    2    2    2    2    1    1    2    2    2    1    2    1    1    2    1    2 
## 1420 1421 1422 1423 1425 1426 1427 1428 1429 1430 1432 1434 1437 1438 1439 1440 
##    1    1    2    2    2    1    1    1    2    1    2    1    2    1    2    1 
## 1441 1442 1443 1445 1446 1448 1452 1455 1456 1457 1459 1460 
##    1    2    1    2    2    1    2    2    2    1    2    1 
## 
## Within cluster sum of squares by cluster:
## [1] 1497003869 1798607593
##  (between_SS / total_SS =  57.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
fviz_cluster(k2,data=numeric_data)

set.seed(123)
fviz_nbclust(numeric_data, kmeans, method = "wss")

set.seed(123)
fviz_nbclust(numeric_data, kmeans, method = "silhouette")

set.seed(123)
gap_stat <- clusGap(numeric_data, FUN = kmeans, nstart = 25,K.max = 10, B = 50)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations
fviz_gap_stat(gap_stat)

# Add cluster's labels to the original dataset
data$Cluster <- k2$cluster

# Create a table for comparing clusters and PriceCategory
table(data$PriceCategory, data$Cluster)
##            
##               1   2
##   Cheap     160 350
##   Expensive 294 216

It coincides in the majority of the examples.

adjusted_rand_index <- adjustedRandIndex(data$PriceCategory, data$Cluster)
print(adjusted_rand_index)
## [1] 0.06813152

The ARI value is 0.06957727. ARI is a measure of similarity between 2 cluster assignments (in this case, your PriceCategory variable and the clusters generated by K-means), adjusted by the possibility of random coincidence. ARI values can range from -1 to 1, where: - ARI = 1 indicates a perfect match between the two groupings. - ARI = 0 indicates a match no better than random chance. - ARI < 0 indicates a worse match than random chance. In this case, an approximated value of 0.07 suggests that there’s a slight match between the clusters generated by K-means and the PriceCategory variable, but this match is only slightly better than random chance. However, it is not a strong match, indicating that the clustering performed by K-means does not align significantly with the existing price categories.